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Abstract 

Large-scale, parallel clusters composed of commodity processors are increas- 
ingly available, enabling the use of vast processing capabilities and dis- 
tributed RAM to solve hard search problems. We investigate Hash-Distributed 
A* (HDA*), a simple approach to parallel best-first search that asynchronously 
distributes and schedules work among processors based on a hash function 
of the search state. We use this approach to parallelize the A* algorithm 
in an optimal sequential version of the Fast Downward planner, as well as a 
24-puzzle solver. The scaling behavior of HDA* is evaluated experimentally 
on a shared memory, multicore machine with 8 cores, a cluster of commodity 
machines using up to 64 cores, and large-scale high-performance clusters, us- 
ing up to 2400 processors. We show that this approach scales well, allowing 
the effective utilization of large amounts of distributed memory to optimally 
solve problems which require terabytes of RAM. We also compare HDA* to 
Transposition-table Driven Scheduling (TDS), a hash-based parallelization 
of IDA*, and show that, in planning, HDA* significantly outperforms TDS. 
A simple hybrid which combines HDA* and TDS to exploit strengths of both 
algorithms is proposed and evaluated. 
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1. Introduction 

Parallel search is an important research area for two reasons. First, many 
search problems, including planning instances, continue to be difficult for se- 
quential algorithms. Parallel search on state-of-the-art, parallel clusters has 
the potential to provide both the memory and the CPU resources required to 
solve challenging problem instances. Second, while multiprocessors were pre- 
viously expensive and rare, multicore machines are now ubiquitous. Future 
generations of hardware are likely to continue to have an increasing number 
of processors, where the speed of each individual CPU core does not increase 
as rapidly as in past decades. Thus, exploiting parallelism is necessary to 
extract significant speedups from the hardware. 

Our work is primarily motivated by domain-independent planning. In 
classical planning, many problem instances continue to pose a challenge for 
state-of-the-art planning systems. Both the memory and the CPU require- 
ments are main causes of performance bottlenecks. The problem is especially 
pressing in sequential optimal planning. Despite significant progress in re- 
cent years in developing domain- independent admissible heuristics jH, 0, S] , 
scaling up optimal planning remains a challenge. Multi-processor, parallel 
planning^ has the potential to provide both the memory and the CPU re- 
sources required to solve challenging problem instances. 

We introduce and evaluate Hash Distributed A* (HDA*), a parallelization 
of A* jij. HDA* runs A* on every processor, where each processor has its 
own open and closed lists. A hash function assigns each state to a unique 
processor, so that every state has an "owner" . Whenever a state is generated, 
its owner processor is computed according to this hash function, and the state 
is sent to its owner. This simple mechanism simultaneously accomplishes 
load balancing as well as duplicate pruning. While the key idea of hash- 
based assignment of states to processors was initially proposed as part of the 
PRA* algorithm by Evett et al. j^], and later extended by Mahapatra and 
Dutt |^], the scalability and limitations of hash-based work distribution and 
duplicate pruning have not been previously evaluated in depth. In addition, 
hash-based work distribution has never been applied to domain-independent 
planning. 

HDA* has two key attributes which make it worth examining in detail. 



3 In this paper, parallel planning refers to computing sequential plans with multi- 
processor planning, as opposed to computing parallel plans with a serial algorithm. 
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First, it is inherently scalable on large-scale parallel systems, because it is a 
distributed algorithm with no central bottlenecks. While there has been some 
recent work in parallel search 0, 0, B] , these approaches are multi-threaded, 
and limited to a single, shared- memory machine. HDA*, on the other hand, 
scales naturally from a single, multicore desktop machine to a large scale, 
distributed memory cluster of multicore machines. When implemented using 
the standard MPI message passing library the exact same code can be 



executed on a wide array of parallel environments, ranging from a standard 
desktop multicore to a massive cluster with thousands of cores, effectively 
using all of the aggregate CPU and memory resources available on the system. 

Second, HDA* is a very simple algorithm - conceptually, the only differ- 
ence between HDA* and A* is that when a node is generated, we compute its 
hash value, and send the node to the closed list of the processor that "owns" 
the hash value. Everything runs asynchronously, and there is no tricky syn- 
chronization. This simplicity is extremely valuable in parallel algorithms, as 
parallel programming is notoriously difficult. Thus, the goal of our work is 
to evaluate the performance and the scalability of HDA*. 

HDA* is implemented as an extension of two state-of-the-art solvers. The 
first solver is the Fast Downward domain-independent planner. We use the 
cost-optimal version with the explicit (merge-and-shrink) abstraction heuris- 
tic reported by Helmert, Haslum, and Hoffman |3j. The second solver is 
an application-specific 24-puzzle solver, which uses the pattern database 



heuristic code provided by Korf and Felner [ll]. A key difference between 
these is the relative speed of processing an individual state. The domain- 
specific 24-puzzle solver processes states significantly faster than the domain- 
independent Fast Downward planner, expanding 2-5 times more states per 
second^ The speed of processing a state can significantly impact the effi- 
ciency of a parallel search algorithm, which is the speedup relative to a serial 
implementation divided by the number of CPU cores. A larger processing 
cost per state tends to diminish the impact of parallel-specific overheads, 
such as the communication and the synchronization overhead (introduced in 
Section [2]). Therefore, using both types of solvers allows us to assess the 
efficiency of HDA* across a range of application problems. 

The scaling behavior of the algorithm is evaluated on a wide range of par- 



4 Korf and Felner's code uses IDA*, which generates states much faster; we only incor- 
porate their pattern database heuristic code in our parallel A* framework. 
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allel machines. First, we show that on a standard, 8-core workstation, HDA* 
achieves speedups up to 6.6. We then evaluate the exact same HDA* code on 
a commodity cluster with an Ethernet network, and two high-performance 
computing (HPC) clusters with up to 2400 processors and 10.5TB of RAM. 
The scale of our experiments goes well beyond previous studies of hash-based 
work distribution. We show that HDA* scales well, allowing effective utiliza- 
tion of a large amount of distributed memory to optimally solve problems 
that require terabytes of RAM. 



The experiments include a comparison with TDS [12l . 113] . a successful 
parallelization of IDA* with a distributed transposition table. Using up to 
64 processors on a commodity cluster, we show that HDA* is 2-65 times 
faster and solves more instances than TDS. We propose a simple, hybrid 
approach that combines strengths of both HDA* and TDS: Run HDA* first, 
and if HDA* succeeds, return the solution. Otherwise, start a TDS search 
where the initial threshold for the depth-first exploration is provided by the 
failed HDA* search. Clearly, when HDA* succeeds, this hybrid algorithm 
runs as fast as HDA*. We show that, when HDA* fails, the total runtime of 
the hybrid approach is comparable to the running time of TDS. 

The rest of the paper is organized as follows. Section [2] presents back- 
ground information. The HDA* algorithm is described in Section [3j We 
then present an empirical evaluation and analysis of the scalability HDA* 
in Section SJ Tuning the performance of HDA* is addressed in Section 
Hash-based distribution is compared with a simpler, randomized work distri- 
bution strategy in Section [6j In Section [7J we compare HDA* with TDS, and 
propose a hybrid strategy which combines the strengths of both algorithms. 
We review other approaches to parallel search in Section [HJ This is followed 
by a summary and discussion of the results, and directions for future work. 



2. Background 

Efficient implementation of parallel search algorithms is challenging due 
to several types of overhead. Search overhead occurs when a parallel imple- 
mentation of a search algorithm expands (or generates) more states than a 
serial implementation. The main cause of search overhead is partitioning of 
the search space among processors, which has the side effect that the access 
to non-local information is restricted. For example, sequential A* can ter- 
minate immediately after a solution is found, because it is guaranteed to be 
optimal. In contrast, when a parallel A* algorithm finds a (first) solution at 
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some processor, it is not necessarily a globally optimal solution. Better solu- 
tions might exist in non-local portions of the search space. A more detailed 
discussion of the search overhead is available in Section 14.2.21 

Synchronization overhead is the idle time wasted at synchronization points, 
where some processors have to wait for the others to reach the synchroniza- 
tion point. For example, in a shared-memory environment, the idle time can 
be caused by mutual exclusion locks on shared data. Finally, communica- 
tion overhead refers to the cost of inter-process information exchange in a 
distributed-memory environment (i.e., the cost of sending a message from 
one processor to another over a network). 

The key to achieving a good speedup in parallel search is to minimize 
such overheads. This is often a difficult task, in part because the overheads 
are interdependent. For example, reducing search overhead usually increases 
synchronization and communication overhead. 

There are several, broad approaches to parallelizing search algorithms. 
This paper focuses on parallelization by partitioning the search space, so 
this section reviews this approach. Other approaches, including paralleliza- 
tion of the computation performed on each state, running a different search 
algorithm on each processor, and parallel window search, are reviewed in Sec- 
tion |8j In addition, this paper focuses on parallel best-first search. The use 
of hash-based work distribution techniques similar to HDA* for breadth-first 
search in model checking is reviewed in Section [SJ 

One general framework for partitioning the search space among paral- 
lel processes first starts with a root process which initially generates some 
seed nodes using sequential search, and assigns these seed nodes among the 
available processors. Then, at each processor, a search algorithm begins to 
explore the descendants of its seed node. These seed nodes become the lo- 
cal root nodes for a local, sequential search algorithm. This basic strategy 
can be applied to depth-first search algorithms, including simple depth-first 
search, branch- and-bound, and IDA*, as well as breadth-first and best-first 
search algorithms such as A*. 

Work stealing is a standard approach for partitioned, parallel search, and 
is used in many applications, particularly in shared-memory environments. 
In work-stealing, each processor maintains a local work queue. When a 
processor P generates new work (i.e., new states to be expanded) w, it places 
w in P's own local queue. When P has no work in its queue, it "steals" work 
from the queue of a busy processor. 

Two key considerations in a particular work-stealing strategy are how to 
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decide which processor to steal work from, as well as how to decide which 
and how much work to steal. Various work-stealing strategies for depth- 
first, linear-space algorithms such as depth- first branch- and-bound IDA*, 
and minimax search have been studied (e.g. 14, 15, 16[). While most work 



on work-stealing has been on MIMD systems, parallelization of IDA* on 
SIMD machines using an alternating, two-phase mechanism, with a search 
phase and a load balancing phase, has also been investigated fl~8l]Fl 

Another approach to search space partitioning (particularly in shared- 
memory search) is derived from a line of work on addressing memory capacity 
limitations by using a large amount of slower, external memory (such as 



disks), to store states in search [19|, |20|, |2l| (external memory was also used 



specifically for planning |22j). An issue with using external memory is the 
overhead of expensive I/O operations, so techniques for structuring the search 
to minimize these overheads have been the focus of work in this area. Korf 
has implemented a multithreaded, breadth-first search using a shared work 
queue which uses external memory 0, 0]. Interestingly, some approaches 
to reducing the I/O overhead in external memory search can be adapted to 
handle the inter-process communication overhead in parallel search. Zhou 
and Hansen ^ introduce a parallel, breadth-first search algorithm. Parallel 
structured duplicate detection seeks to reduce synchronization overhead. The 
original state space is partitioned into collections of states called blocks. The 
duplicate detection scope of a state contains the blocks that correspond to the 
successors of that state. States whose duplicate detection scopes are disjoint 
can be expanded with no need for synchronization. Burns et al. [i[ have 
investigated best-first search algorithms that include enhancements such as 
structured duplicate detection and speculative search. These techniques were 
shown to be effective in a shared memory machine with up to 8 cores. 

We now review the line of work directly related to HDA*. Algorithms 
such as breadth-first or best-first search (including A*) use an open list which 
stores the set of states that have been generated but not yet expanded. In an 
early study, Kumar, Ramesh, and Rao [23| identified two broad approaches 
to parallelizing best-first search, based on how the usage and maintenance 
of the open list was parallelized. In a centralized approach, a single open 



5 In MIMD (Multiple-Instruction, Multiple-Data stream) systems, each processor exe- 
cutes code independently; in SIMD (Single-Instruction, Multiple-Data stream) systems, 
all processors execute the same instruction (on different data). 



6 



list is shared among all processes. Each process expands one of the current 
best nodes from the globally shared open list, and generates and evaluates its 
children. This centralized approach introduces very little or no search over- 
head, and no load balancing among processors is necessary. Furthermore, 
this method is especially simple to implement in a shared-memory architec- 
ture by using a shared data structure for the open list. However, concurrent 
access to the shared open list becomes a bottleneck and inherently limits 
the scalability of the centralized approach, except in cases where the cost of 
processing each node (e.g., evaluating the node with a heuristic function) is 
extremely expensive, in which case overheads associated with shared open 
list access become insignificant. 

In contrast, in a decentralized approach to parallel best-first search, each 
process has its own open list. Initially, the root processor generates and 
distributes some search nodes among the available processes. Then, each 
process starts to locally run best-first search using its local open list (as well 
as a closed list, in case of algorithms such as A*). Decentralizing the open list 
eliminates the concurrency overheads associated with a shared, centralized 
open list, but load balancing becomes necessary. 



Kumar, Ramesh and Rao 23], as well as Karp and Zhang [24], |25] pro- 
posed a random work allocation strategy, where newly generated states were 
sent to random processors. In parallel architectures with non-uniform com- 
munication costs, a straightforward variant of this randomized strategy is 
to send states to a random neighboring processor (with low communication 
cost) to avoid the cost of sending to an arbitrary processor (c.f., [26]). 

In addition to load balancing, another issue that a parallel search al- 
gorithm must address is duplicate detection. In many search applications, 
including domain-independent planning, the search space is a graph rather 
than a tree, and there are multiple paths to the same state. In sequential 
search, duplicates can be detected and pruned by using a closed list (e.g., 



hash table) or other duplicate detection techniques (e.g. [27J, |28|). Efficient 
duplicate detection is critical for performance, both in serial and parallel 
search algorithms, and can potentially eliminate vast amounts of redundant 
work. 

In parallel search, duplicate state detection incurs several overheads, de- 
pending on the algorithm and the machine environment. For instance, in 
a shared- memory environment, many approaches, including work-stealing, 
need to carefully manage locks on the shared open and closed lists. 

Parallel Retracting A* (PRA*) |5] simultaneously addresses the problem 
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of work distribution and duplicate state detection. In PRA*, each processor 
maintains its own open and closed lists. A hash function maps each state 
to exactly one processor which "owns" the state. When generating a state, 
PRA* distributes it to the corresponding owner. If the hash keys are dis- 
tributed uniformly across the processors, load balancing is achieved. After 
receiving states, PRA* has the advantage that duplicate detection can be 
performed efficiently and locally at the destination processor. 

While PRA* incorporated the idea of hash-based work distribution, PRA* 
differs significantly from a parallel A* in that it is a parallel version of RA* 
5fl, a limited memory search algorithm closely related to MA* [29[ and SMA* 
301 ]. When a processor's memory became full, Parallel Retracting A* retracts 
states from the search frontier, and their /-values are stored in their parents, 
which frees up memory. Thus, unlike parallel A*, PRA* does not store all 
expanded nodes in memory, and will not terminate due to running out of 
memory in some process. On the other hand, the implementation of this re- 
traction mechanism in [5} incurs a significant synchronization overhead: when 
a processor P generates a new state s and sends it to the destination pro- 
cessor Q, P blocks and waits for Q to confirm that s has successfully been 
received and stored (or whether the send operation failed due to memory 
exhaustion at the target process). Unlike PRA*, HDA* does not incorpo- 
rate a node retraction mechanism. Also, unlike the original implementation 
of PRA*, HDA* is a fully asynchronous algorithm, where all messages are 
sent / received asynchronously. 

The idea of hash-based work distribution was investigated further by Ma- 
hapatra and Dutt jij], who studied parallel A* on a Hypercube architecture, 
where CPUs are connected by a hypercube network, while in current stan- 
dard architectures machines are connected by either a mesh or torus network. 
As a baseline, they implemented Global Hashing (GOHA), which is similar 
to PRA*, except that GOHA is a parallelization of SEQ_A*, a variant of A* 
which performs partial expansion of states, while HDA* is a parallelization of 
standard A*. They proposed two alternatives to the simple hash-based work 
distribution in PRA*. The first approach by Mahapatra and Dutt, called 
Global Hashing of Nodes and Quality Equalizing (GOHA&QE), decouples 
load balancing and duplicate checking. States are assigned an owner (based 
on hash key) for duplicate checking, and a newly generated state is first sent 
to its owner process for duplicate checking. If the state is in the open list of 
the owner process, then it is a duplicate and discarded. Otherwise, it is added 
to the open list of the owner, and the state is possibly reassigned to another 



8 



process using Dutt and Mahapatra's Quality Equalizing (QE) strategy [26 



In both PRA* and GOHA&QE, the hash function is global - a state can 
be hashed to any of the processors in the system. In parallel architectures 
where communication costs vary among pairs of processors, such a global 
hashing may be suboptimal. Therefore, Mahapatra and Dutt also proposed 
Local Hashing of Nodes and QE (LOHA&QE), which incorporates a state 
space partitioning strategy and allocates disjoint partitions to disjoint pro- 
cessor groups in order to minimize communication costs [6] . Mahapatra and 
Dutt showed that GOHA&QA and LOHA&QE outperformed the simpler, 
global hash-based work distribution method used in PRA* on the Travelling 
Salesperson Problem (TSP). 

Mahapatra and Dutt's local hashing is based on a number of restrictive 
assumptions that do not allow applying this strategy to planning. Specifi- 
cally, local hashing works in problems with so-called levelized search graphs. 
In a levelized graph, a given state will always have the same depth (distance 
from root node), regardless of the path that connects the root and the state. 
The notion of the levelized graph can sometimes be extended to exploit local 
hashing if the search space has some regularities on depths such as multiple 



sequence alignment [31] . However, planning and the sliding-tile puzzle do 
not belong to this class of problems. The reason is that the same state could 
be reached via different-length paths. For example, if two cities A and B are 
connected by two routes of different lengths, then driving a truck from A to 
B via each route will result in the same state but the paths from the starting 
state will have different lengths. 



Transposition-table driven work scheduling (TDS) [12|, [13J is a distributed 
memory, parallel IDA* algorithm. Similarly to PRA*, TDS distributes work 
using a state hash function. The transposition table is partitioned over pro- 
cessors to be used for detecting and pruning duplicate states that arrive at 
the processor. Thus, TDS distributes a transposition table for IDA* among 
the processing nodes, similarly to how PRA* distributes the open and the 
closed lists for A*. This distributed transposition table allows TDS to exhibit 
a very low (sometimes negative) search overhead, compared to a sequential 
IDA* that runs on a single computational node with limited RAM capac- 
ity. TDS achieved impressive speedups in applications such as the 15-puzzle, 
the double-blank puzzle, and the Rubik's cube, on a distributed-memory 
machine. The ideas behind TDS have also been successfully integrated in 
adversarial two-player search [32|, |33|, [34 . 

Thus, the idea of hash-based distribution of work is not new, but there 
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are several reasons to revisit the idea and perform an in-depth evaluation 
at this point. First, the primary motivation for this work was to advance 
the state of the art of domain-independent planning by parallelizing search. 
While there has been some previous, smaller-scale work on parallel planning, 
large-scale parallel planning has not been previously attempted, and hash- 
based work distribution is a natural approach for scaling parallel planning to 
large-scale parallel clusters. 

Second, parallel systems have become much more common today than 
when the earlier work by Evett et al. [5!] and Mahapatra and Dutt p was 
done, and the parallel systems which are prevalent today have very different 
architectures. The most common parallel architectures today are commodity, 
multicore, shared memory machines, as well as distributed memory clusters 
which are composed of shared memory multicore nodes. Hash-based work 
distribution is a simple approach that can potentially scale naturally from 
single, multicore nodes to large clusters of multicore nodes, and it is impor- 
tant to evaluate its performance on current, standard parallel architectures. 
In addition, previous algorithms based on this idea, such as PRA* and Ma- 
hapatra and Dutt's method, make some assumptions that are specific to the 
hardware architecture in use. In contrast, we aim at obtaining an algorithm 
as general as possible, avoiding hardware-specific assumptions. In fact, as 
mentioned earlier, our MPI-based implementation can be run on a variety of 
platforms, including shared-memory and distributed-memory systems. 

Third, there are some important issues that were not fully explored in the 
earlier work. For example, factors which can potentially limit the scaling of 
hash-based work distribution, such as search overhead and communication 
overhead, have not been analyzed in detail. Also, the performance impact of 
asynchronous vs synchronous communication, as well as the impact of using 
a hash function for work distribution, as opposed to a randomized strategy, 
have not been previously investigated. 

Fourth, while the work on TDS showed the utility of asynchronous, hash- 
based work distribution for IDA*, this previous work was done on 15-puzzle 
variants and the Rubik's cube, which are two domains where the overhead 
incurred by re-exploration of states in IDA* is known to be relatively small. 
In some other domains, this overhead can be quite significant, which can 
result in significant costs on a cluster environment. Thus, an investigation 
of the scalability of hash-based, parallel A* and a comparison with TDS is 
worthwhile. 

Thus, while previous work has considered global hash-based distribution 
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either as a component of a more complex algorithm [5] or as a straw man 
against which local hash-based distribution was considered jij], this is the 
first paper which analyzes the scalability and limitations of global hashing 
in depth. An early version of this work has been previously presented in a 
conference paper 35(. However, that initial work was limited to up to 128 
CPU cores on an older system, contained no results on the 24-puzzle, no 
comparison to TDS, and included a less detailed evaluation and analysis. 



3. Hash Distributed A* 

We now describe Hash Distributed A* (HDA*), a simple parallelization 
of A* which uses the hash-based work distribution strategy originally pro- 
posed in PRA* [5]. In HDA* the closed and open lists are implemented as 
a distributed data structure, where each processor "owns" a partition of the 
entire search space. The local open and closed list for processor P is denoted 
Openp and Closedp. The partitioning is done via a hash function on the 
state, as described later. 

HDA* starts by expanding the initial state at the root processor. Then, 
each processor P executes the following loop until an optimal solution is 
found: 

1. First, P checks if one or more new states have been received in its 
message queue. If so, P checks for each new state s in Closedp, in 
order to determine whether s is a duplicate, or whether it should be 
inserted in OpenJ§. 

2. If the message queue is empty, then P selects a highest priority state 
from Openp and expands it, resulting in newly generated states. For 
each newly generated state s, a hash key K{s) is computed based on 
the state representation, and s is sent to the processor which owns 
K(s). This send is asynchronous and non-blocking. P continues its 
computation without waiting for a reply from the destination. 

In a straightforward implementation of hash-based work distribution on 
a shared memory machine, each thread owns a local open/closed list imple- 
mented in shared memory, and when a state s is assigned to some thread, 



6 Even if the heuristic function Q is consistent, parallel A* search may sometimes have 
to re-open a state saved in the closed list. For example, P may receive many identical 
states with various priorities from different processors and these states may reach P in 
any order. 
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the writer thread obtains a lock on the target shared memory, writes s, then 
releases the lock. Note that whenever a thread P "sends" a state s to a des- 
tination dest(s), then P must wait until the lock for shared open list (or mes- 
sage queue) for dest(s) is available and not locked by any other thread. This 
results in significant synchronization overhead - for example, it was observed 
in [i[ that a straightforward implementation of PRA* exhibited extremely 
poor performance on the Grid search problem, and multicore performance for 
up to 8 cores was consistently slower than sequential A*. While it is possible 
to speed up locking operations by using, for example, highly optimized lock 
operations implementations in inline assembly language, the performance 
degradation due to synchronization remains a considerable problem. 

In contrast, the open/closed lists in HDA* are not explicitly shared among 
the processors. Thus, even in a multicore environment where it is possible to 
share memory, all communications are done between separate MPI processes 
using non-blocking send/receive operations. Our program implements this 
by using MPI_Bsend and MPI_Iprobe, and relies on highly optimized message 
buffers implemented in MPI. 

Every state must be sent from the processor where it is generated to its 
"owner" processor. In their work with transposition-table driven scheduling 



for parallel IDA*, Romein et al. [12[ showed that this communication over- 
head could be overcome by packing multiple states with the same destination 
into a single message. HDA* uses this state packing strategy to reduce the 
number of messages. The relationship between performance and message 
sizes depends on several factors such as network configurations, the number 
of CPU cores, and CPU speed. In our experiments, 100 states are packed 
into each message on a commodity cluster using more than 16 CPU cores and 
a HPC cluster, while 10 states are packed on the commodity cluster using 
less than 16 cores. 

In a decentralized parallel A* (including HDA*), when a solution is dis- 



covered, there is no guarantee at that time that the solution is optimal [23 



When a processor discovers a locally optimal solution, the processor broad- 
casts its cost. The search cannot terminate until all processors have proved 
that there is no solution with a better cost. In order to correctly terminate 
a decentralized parallel A*, it is not sufficient to check the local open list at 
every processor. We must also ensure that there is no message en route to 
some processor that could lead to a better solution. Various algorithms to 
handle termination exist. In our implementation of HDA*, we used the time 



algorithm of Mattern [36|, which was also used in TDS. 
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Mattern's method is based on counting sent messages and received mes- 
sages. If all processors were able to count simultaneously, it would be trivial 
to detect whether a message is still en route. However, in reality, different 
processors Pi will report their sent and received counters, S(ti) and R{ti), at 
different times £j. To handle this, Mattern introduces a basic method where 
the counters are reported in two different waves. Let R* = Y^i R{U) be the 
accumulated received counter at the end of the first wave, and S'* = J2i S(t[) 
be the accumulated sent counter at the end of the second wave. Mattern 
proved that if S'* = R*, then the termination condition holds (i.e., there are 
no messages en route that can lead to a better solution). 

Mattern's time algorithm is a variation of this basic method which allows 
checking the termination condition in only one wave. Each work message 
(i.e., containing search states to be processed) has a time stamp, which can be 
implemented as a clock counter maintained locally by each processor. Every 
time a new termination check is started, the initiating processor increments 
its clock counter and sends a control message to another processor, starting a 
chain of control messages that will visit all processors and return to the first 
one. When receiving a control message, a processor updates its clock counter 
C to max(C, T), where T is the maximum clock value among processors 
visited so far. If a processor contains a received message m with a time 
stamp t m > T, then the termination check fails. Obviously, if, at the end 
of the chain of messages, the accumulated sent and received counters differ, 
then the termination check fails as well. 

In hash based work distribution, the choice of the hash function is essen- 
tial for achieving uniform distribution of the keys, which results in effective 
load balancing. Our implementation of HDA* uses the Zobrist function 37 
to map a SAS+ state representation 38j to a hash key. The Zobrist function 
is commonly used in the game tree search community to detect duplicate 
states. The Zobrist hash value is computed by XOR'ing predefined random 
numbers associated with the components of a state. When a new state is 
generated, the hash value of the new state can be computed incrementally 
from the hash value of the parent state by incrementally XOR'ing the state 
components that differ. The Zobrist function was previously used in domain- 



independent planning in MacroFF [39(. It is possible for two different states 



to have the same hash key, although the probability of such a collision is 
extremely low with 64-bit keys. In MacroFF, as well as an earlier version of 



HDA* 35|, duplicate checking in the open/list was performed by checking if 



the hash key of a state was present in the open/closed list, so there was a 



13 



non-zero (albeit tiny) probability of a false positive duplicate check result. In 
this paper, our HDA* implementations perform duplicate checks by compar- 
ing the actual states. Although this is slightly slower than comparing only 
the hash key, duplicate checks are guaranteed to be correct. 

4. Scalability of HDA* 

We experimentally evaluated HDA* on top of a domain-independent 
planner and an application-specific, 24-puzzle solver. Our hardware envi- 
ronments, including a single, multicore machine (Mult i core), a commodity 
cluster (Commodity), and two high-performance clusters (HPC clusters) with 
up to 2400 processors (HPC1, HPC2), are shown in Tabled] In all of our exper- 
iments, HDA* is implemented in C++, compiled with g++ and parallelized 
using the MPI message passing library. 



System 


node description 


# cores 


RAM 


Interconnection 


Name 




per node 


per node 


between nodes 


Multicore 


2.33GHz 2x 4-core 
Xeon L5410 


8 


32GB 


n/a 


Commodity 


2.33GHz 2x 4-core 
Xeon L5410 


8 


16GB 


lGbps(x2) 
Ethernet 


HPC1 


2.4GHz 8x 2-core 
AMD Opteron 


16 


32GB 


20Gb Infiniband 


HPC2 


2.93GHz 2x 6-core 
Xeon X5670 


12 


54GB 


QDR 

Infinibandx2 
(80 Gbps) 



Table 1: Parallel machines used in experiments 



We first describe experimental results for domain-independent planning. 
We parallelized the sequential optimal version of the Fast Downward plan- 
ner, enhanced with the so-called LFPA heuristic, which is based on explicit 
(merge-and-shrink) state abstraction |. All the reported results are ob- 
tained with the abstraction size set to 1,000. Preliminary experiments with 
the abstraction size set to 5,000 did not change the results qualitatively. As 
benchmark problems, we use classical planning instances from past plan- 
ning competitions. We selected instances that are hard or unsolvable for the 
sequential optimal version of Fast Downward. 
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HDA*, like other asynchronous parallel search algorithms, behaves non- 
deterministically, resulting in some differences in search behavior between 
identical invocations of the algorithms. However, on the runs where we col- 
lected multiple data points, we did not observe significant differences between 
runs of HDA*. Therefore, due to the enormous resource requirements of a 
large-scale experimental studyfl the results shown are for single runs. 

4-1- Asynchronous vs Synchronous Communications: Experiments on a Sin- 
gle, Multicore Machine 
First, we evaluate HDA* on a single, multicore machine (presented in 
Table [1]) in order to investigate the impact of asynchronous vs synchronous 
communications in parallel A*. We compare HDA* with sequential A* and 
a shared- memory implementation of Parallel Retracting A* (PRA*) [5j on a 
single, multicore machine. As described in Section &\ PRA* uses the same 
hash-based work distribution strategy as HDA*, but uses synchronous com- 
munications. As in Burns et al.'s experiments [9J, our PRA* implementation 
does not include the node retraction scheme because the main goal of our 
experiments is to show the impact of eliminating synchronization overhead 
from PRA*. 

Our HDA* implementation is the same MPI-based implementation used 
in our larger-scale, distributed memory experiments described below. It exe- 
cutes a separate OS process for each thread of execution, and we rely on the 
message passing functions in MPI for asynchronous communications. While 
it may be possible to develop a more efficient implementation of HDA* specif- 
ically for shared memory machines (e.g., using kernel threads and shared 
memory constructs), we wanted to investigate the scalability of the same 
HDA* implementation on both shared and distributed memory machines. 

Low level optimizations were implemented to make both the HDA* and 
PRA* as fast as possible. In addition to locks available in the Boost C++ 
library, we also incorporated spin locks based on the "xchgl" assembly oper- 
ation in order to speed up PRA*. 

Each algorithm used the full 32 gigabytes of RAM available on the ma- 
chine at hand. That is, n-core HDA* spawns n processes, each using 32 jn 
gigabytes of RAM, sequential A* used the full 32GB available, and the mul- 
tithreaded PRA* algorithm shared 32GB of RAM among all of the threads. 



7 In addition to usage charges for the clusters, there are issues of resource contention 
because the clusters are shared among hundreds of users. 
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1 core 


4 cores 


8 cores 






A* 


HDA* 


PRA* 


HDA* 


PRA* 


Abst 


Opt 




time 


spd 


eff 


spd 


eff 


spd 


eff 


spd 


eff 


time 


len 


Average 


971.72 


2.71 


0.68 


2.56 


0.64 


4.98 


0.62 


4.45 


0.56 


2.76 




DcpotlO 


99.82 


2.51 


0.63 


2.45 


0.61 


4.61 


0.58 


4.11 


0.51 


1.96 


24 


Depotl3 


1561.83 


2.66 


0.67 


2.57 


0.64 


4.78 


0.60 


4.44 


0.56 


4.16 


25 


Drivcrlog8 


102.55 


2.59 


0.65 


2.14 


0.54 


4.90 


0.61 


3.87 


0.48 


0.14 


22 


Freecell5 


137.01 


2.99 


0.75 


2.99 


0.75 


5.87 


0.73 


5.83 


0.73 


7.75 


30 


Frccccll7 


2261.67 


3.12 


0.78 


3.06 


0.77 


5.88 


0.74 


5.85 


0.73 


9.60 


41 


Roverl2 


923.23 


2.73 


0.68 


2.46 


0.61 


5.03 


0.63 


4.43 


0.55 


0.10 


19 


Satellite6 


104.83 


2.25 


0.56 


2.06 


0.51 


4.30 


0.54 


3.61 


0.45 


0.09 


20 


ZenoTrav9 


157.98 


2.57 


0.64 


2.29 


0.57 


4.61 


0.58 


3.66 


0.46 


0.19 


21 


ZenoTravll 


424.68 


2.42 


0.60 


2.10 


0.52 


4.40 


0.55 


3.60 


0.45 


0.22 


14 


PipesNoTkl4 


248.77 


3.91 


0.98 


3.65 


0.91 


5.50 


0.69 


4.74 


0.59 


1.68 


30 


PipesNoTk24 


1046.94 


2.97 


0.74 


2.72 


0.68 


5.37 


0.67 


4.82 


0.60 


5.99 


24 


Pcgsol27 


178.71 


2.97 


0.74 


2.89 


0.72 


5.87 


0.73 


5.09 


0.64 


1.11 


28 


Pegsol28 


773.36 


2.98 


0.75 


2.94 


0.73 


5.87 


0.73 


4.92 


0.61 


0.65 


35 


Airportl7 


322.21 


3.54 


0.89 


3.52 


0.88 


6.62 


0.83 


6.77 


0.85 


13.09 


88 


Gripper8 


304.82 


2.57 


0.64 


2.29 


0.57 


4.41 


0.55 


3.63 


0.45 


0.28 


53 


Gripper9 


1710.39 


2.63 


0.66 


2.31 


0.58 


4.51 


0.56 


4.05 


0.51 


0.36 


59 


Mystery6 


315.21 


3.08 


0.77 


3.01 


0.75 


5.64 


0.70 


5.39 


0.67 


17.51 


11 


Truck5 


365.38 


2.17 


0.54 


2.32 


0.58 


4.23 


0.53 


4.00 


0.50 


0.26 


25 


Truck6 


3597.24 


2.41 


0.60 


2.23 


0.56 


4.53 


0.57 


4.11 


0.51 


0.30 


30 


Truck8 


2194.38 


2.39 


0.60 


2.23 


0.56 


4.35 


0.54 


3.82 


0.48 


0.22 


25 


Sokobanl9 


157.82 


2.90 


0.72 


2.57 


0.64 


5.61 


0.70 


4.76 


0.60 


1.59 


164 


Sokoban22 


428.91 


3.07 


0.77 


3.03 


0.76 


5.99 


0.75 


5.27 


0.66 


1.53 


172 


BlockslO-2 


327.03 


2.83 


0.71 


2.39 


0.60 


5.51 


0.69 


4.24 


0.53 


1.04 


34 


Logist00-7-l 


1235.26 


2.40 


0.60 


2.16 


0.54 


4.43 


0.55 


3.75 


0.47 


0.07 


44 


Logist00-9-l 


2082.76 


2.46 


0.61 


2.45 


0.61 


4.44 


0.55 


4.32 


0.54 


0.13 


30 


Miconicl2-2 


2308.03 


2.11 


0.53 


2.19 


0.55 


3.76 


0.47 


3.46 


0.43 


0.08 


40 


Miconicl2-4 


2463.13 


2.15 


0.54 


2.18 


0.54 


3.87 


0.48 


3.55 


0.44 


0.08 


41 


Mprime30 


1374.27 


2.52 


0.63 


2.49 


0.62 


4.58 


0.57 


4.64 


0.58 


7.18 


9 



Table 2: Comparison of sequential A*, HDA* and PRA* (without node retraction) 
on 1, 4, and 8 cores on the Multicore machine. Runtimes (in seconds), speedup 
(spd), efficiency (eff), abstraction heuristic initialization times (not included in 
runtimes in previous columns), and optimal plan length are shown. 



Table [2] shows the speedup of HDA* and PRA* for 4 cores and 8 cores. 
We show only instances that can be solved by the serial planner with 32GB 
RAM available. In addition to runtimes for the sequential A* algorithm, 
the speedup and the parallel efficiency are shown for HDA* and PRA*. 
Efficiency is defined as S/P, where S is the speedup over a serial run and 
P is the number of cores. As shown in Table [2j HDA* clearly outperforms 
PRA*. With 4 cores, the speedup of HDA* ranges from 2.11 to 3.91, and the 
efficiency ranges from 0.53 to 0.98. With 8 cores, the speedup of HDA* ranges 
from 3.76 to 6.62, and the efficiency ranges from 0.47 to 0.83. These results 
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demonstrate the benefit of asynchronous message passing over a synchronous 
implementation of hash-based work distribution. 

4-2. Planning Experiments on a HPC Cluster 

Next, we investigate the scaling behavior of HDA* on the HPC2 cluster, 
(see machine specs in Tabled]). We used 1-200 nodes in our experiments (i.e., 
1-2400 cores). 

As in Table [2j the times shown in Table [3] include the time for the search 
algorithm execution, and exclude the time required to compute the abstrac- 
tion table for the LFPA heuristic, since this phase of Fast Downward+LFPA 
has not been parallelized yet and therefore requires the same amount of time 
to run regardless of the number of cores. For example, the IPC-6 Pegsol-30 
instance, which requires 168.18 seconds with 60 cores, was solved in 20.54 
seconds with 1200 cores, plus 1.30 seconds for the abstraction table gener- 
ation. A value of "-" for the runtimes in Table [3] indicates a failure, i.e., 
the planner terminated because one of the nodes ran out of memory. For 
example, the Freecell-12 instance was first solved using 600 cores. 

Let tk be the runtime to solve a problem using k processors. Standard 
metrics for evaluating parallel performance on n processors include speedup 
and efficiency, defined as as S n = ti/t n , and E n = S n /n, respectively. These 
standard metrics have limited applicability for parallel A* because sequen- 
tial runtimes cannot be measured for hard problems. On hard problems that 
truly require large-scale parallel A*, sequential A* exhausts RAM and termi- 
nates before finding a solution. Since sequential runtime t\ cannot be mea- 
sured, S n and E n cannot be computed. While we could use only benchmarks 
which can be solved using a single processor (as we did for our multicore 
experiment in Section r4.ll) . this would restrict the benchmark set to prob- 
lems that can be solved by A* using only the RAM available on 1 processing 
node. Suppose there is 4.5GB RAM per core, as is the case with our HPC2 
cluster. A single-threaded A* algorithm that generates 50,000 new states per 
second, using 50 bytes per state, will exhaust 4.5GB within 30 minutes. In 
fact, because the state size for difficult planning instances is usually much 
larger than 50 bytes, serial A* exhausts 4.5GB memory much more quickly. 
Problems that require only a few minutes to solve using sequential search are 
poor benchmarks for large-scale parallel search, because such problems can 
be solved in a few seconds using 1000 processors. 

Mahapatra and Dutt also noted that sequential runtimes were not avail- 
able for their scalability experiments jof . They evaluated their algorithms by 
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12 


24 


■ 


144 


300 


600 


1200 


2400 


Abst 


Opt 


Freeccll7 


175.87 


95.27 


38.31 


18.11 


10.69 


10.96 


18.11 


68.40 


4.73 


41 


Logistics00-7-l 


115.71 


56.41 


23.88 


10.78 


6.50 


6.80 


18.76 


48.34 


0.09 


44 


Logistics00-9-l 


175.72 


91.48 


38.43 


17.13 


9.34 


7.73 


18.64 


61.43 


0.13 


30 


Mprime30 


123.85 


71.82 


32.07 


14.92 


8.11 


6.55 


8.03 


28.38 


4.10 


9 


Pogsol.p28 


52.89 


31.59 


12.90 


6.38 


4.41 


14.73 


21.90 


57.18 


0.52 


35 


PipeNoTank24 


89.33 


49.88 


19.99 


9.05 


5.55 


5.69 


19.42 


65.30 


3.07 


24 


Probfcell-4-1 


112.07 


60.52 


25.64 


11.82 


6.75 


5.90 


9.18 


31.57 


1.62 


19 


Rovcr6 


445.17 


249.76 


101.71 


44.46 


23.27 


14.19 


11.57 


18.76 


0.12 


36 


Rover 12 


42.42 


26.36 


10.67 


5.10 


3.61 


4.14 


8.07 


26.03 


0.12 


19 


Sokoban.p24 


136.43 


79.95 


34.00 


17.64 


26.79 


27.43 


51.88 


- 


0.69 


205 


Sokoban.p28 


171.97 


93.00 


39.95 


19.47 


25.13 


25.75 


49.70 


- 


0.62 


135 


ZcnoTravclll 


38.49 


21.84 


8.28 


4.33 


3.15 


4.77 


13.61 


28.61 


0.20 


14 


Freecell6 


_ 


386.55 


163.18 


70.85 


37.14 


22.03 


20.85 


52.59 


4.36 


34 


Pegsol.p29 


- 


146.08 


61.49 


27.78 


15.08 


11.18 


14.11 


52.84 


6.45 


37 


PipesTanklO 


- 


689.17 


292.89 


126.54 


62.68 


33.81 


22.01 


33.78 


4.25 


19 


Satellite? 


- 


1339.41 


415.04 


147.22 


63.39 


33.40 


24.50 


37.55 


0.19 


21 


Sokoban.p25 


- 


108.73 


44.00 


22.23 


19.46 


26.40 


47.44 


- 


1.06 


155 


DrivcrLogl3 




- 


163.65 


66.60 


33.82 


20.31 


18.54 


130.18 


0.25 


26 


Logistics00-8-l 




- 


222.63 


95.03 


91.91 


26.18 


27.19 


46.58 


0.09 


44 


Logistics00-9-0 




- 


210.95 


94.24 


45.07 


25.67 


22.48 


65.00 


0.13 


36 


Mprime24 




- 


60.60 


33.53 


16.97 


10.72 


10.78 


23.86 


25.31 


8 


Pegsol.p30 




- 


168.18 


75.79 


39.51 


22.39 


20.54 


47.44 


1.30 


48 


PipeNoTanklS 




- 


131.98 


58.09 


29.46 


18.77 


15.93 


141.66 


2.28 


30 


PipcsTank9 




- 


418.51 


180.63 


89.31 


47.85 


29.22 


34.26 


5.43 


18 


Probfcell-5-3 


- 


- 


733.73 


315.30 


156.25 


81.34 


47.80 


47.86 


2.48 


24 


Sokoban.p26 


- 


- 


95.44 


47.36 


33.79 


34.26 


70.25 


- 


1.04 


135 


Sokoban.p27 


- 


- 


167.96 


79.19 


48.15 


33.52 


37.98 


- 


1.54 


87 


Depotl6 


- 


- 


- 


461.64 


226.00 


116.86 


64.08 


58.78 


3.97 


25 


Freecellll 




- 




304.71 


151.55 


81.74 


54.63 


99.66 


6.10 


56 


Mprimel5 




- 




128.58 


61.32 


32.74 


21.52 


26.42 


21.04 


6 


PipeNoTank20 








109.74 


51.77 


34.32 


20.59 


120.82 


3.54 


28 


PipeNoTank32 








121.60 


61.85 


32.90 


22.81 


92.47 


3.68 


30 


PipesTankl4 








126.94 


62.53 


34.69 


23.94 


37.51 


6.00 


38 


PipesTank22 








183.47 


90.66 


50.08 


31.02 


50.57 


8.30 


30 


Probfcell-5-1 








659.28 


323.29 


165.22 


89.94 


62.69 


3.10 


24 


Probfcell-5-2 








375.00 


185.41 


95.11 


56.24 


89.90 


2.80 


23 


Probfccll-5-4 








352.93 


174.56 


90.15 


51.50 


46.54 


3.76 


23 


Probfccll-5-5 








546.76 


268.79 


137.54 


76.12 


59.21 


2.36 


25 


ZcnoTravell2 








291.60 


138.88 


72.40 


48.09 


57.17 


0.24 


21 


Freecell9 










334.06 


172.59 


96.56 


93.50 


6.49 


43 


PipeNoTank33 










185.34 


97.02 


54.70 


55.72 


5.04 


32 


PipeNoTank35 










205.55 


105.50 


58.83 


53.94 


7.63 


22 


Frcccclll2 












268.28 


148.74 


117.21 


5.32 


47 


PipeNoTank25 












266.94 


142.34 


101.08 


3.57 


32 


PipeNoTank27 












319.07 


170.78 


103.62 


5.14 


26 



Table 3: Planning runtimes on the HPC2 cluster (see Table Q] for machine specs). 



computing approximate speedups, denned as follows. Let p m i n be the mini- 
mum number of processors that solved the problem. They make the assump- 
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tion that the speedup with p min processors is p min (i.e., they assume that 
linear speedup can be obtained up to p min processors). Based on this, they 
estimate that sequential runtime is p m in x t Pmm > and speedups for p > p min 
processors can be computed by computing the ratio of parallel runtime to 
this estimate of sequential runtime. 

Mahapatra and Dutt argue that this is a conservative estimate of speedup 
because overheads increase as the number of processors increases, so the 
assumption that S p = p m %n is a lower bound on the the actual speedup for p > 
Pmin- However, there are some issues with this approach: (1) such estimates 
of "speedup" are not actual speedup values, (2) p min can be different for 
different problem instances, and (3) when p min is large, e.g., if a problem is 
first solved using 600 cores, including these estimated speedups significantly 
biases computations of average speedup to seem artificially high - in other 
words, assuming that linear speedups are obtainable up to p m in processors is 
unrealistic when p min is large. 

Thus, we take a different approach to measuring the scalability of parallel 
search, based on the performance of n processors relative to the performance 
on Pmi n . Let p min be the smallest number of cores that solves a problem, and 
tmin be the wall-clock runtime for p m in cores. The relative speedup is defined 
as S n = tmin/tn, and the relative efficiency is defined as E n = S n /(n/p min ). 
Relative efficiency has been used by other researchers, c.f.. Nicwiadomski et 
al. 4o| (who called it "speedup efficiency"). 

A second issue related to measuring performance on a cluster of multi- 
core machines is the baseline configuration of a single processing node. As 
shown later in Section I4.2.5[ the number of cores used per processing node 
has a significant effect on performance. Our main goal is to investigate the 
scalability of HDA* while fully utilizing all available processors, so in our 
scalability experiments, we use all cores on a processor, e.g., on the HPC2 
cluster, we use the 12 cores per core, allocating 4.5GB RAM per core. 

Table [3] shows the wall-clock runtimes of HDA* on a set of 45 IPC bench- 
mark problems. The results are grouped according to the smallest number 
of processors that solved the problem. As mentioned earlier, a "-" indicates 
a failure, i.e., the planner terminated because one of the cores ran out of 
memory. For example, the Depotl6 instance was first solved using 144 cores. 
Sokoban.p24-Sokoban.p27 failed with 2400 cores, even though they could be 
solved with fewer cores. 

Figure [1] shows that the relative efficiency (see above) of HDA* on plan- 
ning generally scales well for 2p m i n and 4p min cores, even when p min = 600 
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Planning 



- HPC Cluster 2 Relative Efficiency 




Figure 1: Planning relative efficiency on the HPC2 cluster (see Table Q] for machine specs). 

cores. Even for 8p m i n cores, the relative efficiency is over 0.5 for p min G 
{12,24,60,144,300}. However, as p/p m i n increases, the relative efficiency 
degrades. For p min = 12, when p = 1200 (p/p m in = 100), relative efficiency 
is 9.25%, and for p = 2400 cores, the relative efficiency is a mere 2.2%. 

In other words, HDA* scales reasonably efficiently for up to 4-8 times 
Pmin, the minimum number of processors that can solve the problem. How- 
ever, as the number of processors is further increased, there is a point of 
diminishing returns, and eventually, adding more processors can result in 
longer runtimes. An extreme case of this scaling limitation can be seen on 
the Sokoban benchmark problems (Sokoban.p24-p.28), which were solvable 
with up to 1200 cores (with diminishing returns), but terminated due to 
memory exhaustion at some processor on 2400 cores. On the other hand, it 
is important to note that even for very large values of p m in (e.g., 600 cores), 
HDA* continues to scale very efficiently for 4p m j n (2400) cores. 

4-2.1. Load Balance 

A common metric for measuring how evenly the work is distributed among 
the cores is the load balance, defined as the ratio of the maximal number of 
states expanded by a core and the average number of states expanded by 
each core. As shown in Figure [21 HDA* achieves good load balance when 
p/Pmin < 8. While load balance tends to degrade as the number of processors 
increases, it is important to note that load imbalance does not seem to be 
simply caused by using a large, absolute number of processors. For instance, 
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Figure 2: Planning load balance on the HPC2 cluster. 

on all 6 problems where p m in > 600, the load balance for 2400 cores is less 
than 1.10. 

One possible reason for load imbalance may be "hotspots" - frequently 
generated duplicate nodes mapped to a small number of cores by the hash 
function. This is caused by transpositions in the search space, which are 
states that can be reached through different paths. In HDA*, if a processor 
receives a state s which is already in the closed list but the g-value of s is 
smaller than that in the closed list, s must be enqueued in the open list. 
However, the heuristic value of s is not recomputed in saving s to the open 
list, because the value is already saved in the closed list. For example, in 
solving PipesNoTk24 with 2400 cores, more than 70% of generated states 
are duplicates. A processor involved in a hotspot receives about 377% more 
duplicate states than the processor receiving duplicate states least frequently, 
although we observe that they receive similar amounts of work. As a result, 
the numbers of calls for the heuristic function are different (about 144%) 
between these processors. Thus, a processor which executes fewer heuristic 
evaluations (relative to other processors receiving a comparable number of 
states) has a higher state expansion rate than the other processors, resulting 
in load imbalance. 
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4-2.2. Search Overhead 

The search overhead, which indicates the extra states explored by parallel 
search, is defined as: 

. number of states expanded by parallel search 

SO = 100 x — — - 1 . 

number oi states expanded by baseline search 

This is the percentage of extra node expansions performed compared to a 
given baseline algorithm or hardware configuration. For a direct measure- 
ment of the search overhead for some given instance our baseline algorithm 
(configuration) will be HDA* on p min e {12, 24, 60, 144, 300, 600, 1200, 2400} 
cores, the minimal number of cores that solves that instance. This has the 
advantage that, by definition, the baseline configuration will always succeed, 
allowing a direct measurement of the search overhead. As we show in this 
section, it is also possible to analyze the search overhead compared to se- 
rial A* even in cases where serial A* fails. Such comparisons can show how 
effective HDA* is in terms of wasted search effort. 

We start by identifying possible causes of the search overhead in the 
case when the baseline algorithm is serial A*. Let c* be the cost of an 
optimal solution. Serial A* with a consistent heuristic in use expands all 
states with / < c*, and some of the states with f = c*. No states with 
f > c* get expanded. No state is expanded more than once. In contrast, 
parallel variants of A*, including HDA*, can re-expand states with / < c* 
which are not re-expanded by A*, can expand states with / > c*, and can 
expand a different number of states with f = c* than serial A*. Possible 
causes include incomplete local knowledge of open lists at other processors, 
and nondeterministic travel time and arrival order of messages containing 
states. 

Parallel A* usually expands more nodes than serial A*. If we consider 
only nodes with / < c* or / > c*, the best that parallel A* can possibly 
achieve is matching the number of expansions performed by serial A* with 
a consistent heuristic. Thus, while it is possible, in principle, for parallel A* 
to expand fewer nodes than serial A* with a consistent heuristic, this is only 
possible if parallel A* expands fewer nodes with f = c* than serial A*. 

Define R < as the fraction of expanded nodes with f < c*. R= and i?> are 
defined similarly. Let R r be the ratio of node re-expansions, which is the total 
number of re-expansions divided by the total number of expansions. Figure [3] 
summarizes R < , R = and R > for domain- independent planning, according to 
Pmin- R r , the ratio of re-expansions, is shown in Figure SJ 
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Figure 3: Average values for i?<, R = and i?>, on the HPC2 cluster. 

When serial A* fails to solve an instance, a direct measurement of the 
search overhead relative to serial A* cannot be performed. Fortunately, in 
such cases, our "i?* metrics" (_R<, R=, R> and R r ) can allow us to make 
inferences about the search overhead. The key observation is that serial A* 
expands all nodes with f < c*. Therefore, if HDA* exhibits low values for 
R=, R> and R r , on some run, we can conclude that the search overhead is low 
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Figure 4: Ratio of node re-expansions R r , in planning, on the HPC2 cluster. Left: average 
values computed across all instances with the same p m in- Right: maximum values. 



in that run. In other words, in such cases, there is little wasted search effort 
introduced by distributing the search. This appears to be the case for almost 
all instances, when using p m in cores. A noticeable exception is Mprimel5 
(p min = 144), where R K = 6% and R = = 94%E As p, the number of cores in 
use for a given instance, grows increasingly larger than p m in, the R* metrics 
maintain very good values for a while. Then a degradation can eventually 
be observed, for those instances where p min < 144 (Figure [3]). 

To explain the good i?> values of HDA* run on p m i n cores and other core 
configurations as well (cf. Figure [3]), we start by recalling that serial A* with 
a consistent heuristic will expand all nodes with / = v before expanding any 
node with f > v. Such a strict monotonicity cannot be guaranteed for HDA*. 
However, we believe that HDA* can expand states almost monotonically, 
especially when the number of nodes with / < c* is large enough to make the 
instance challenging to a p m j n -core configuration. More specifically, consider 
an instance where the number of nodes with / < c* is large, and the number 
of nodes with f = c* that end up being expanded by serial A* is very small 
in comparison. Serial A* will expand all nodes with f < c*, after which 
it will start expanding nodes with f = c*. Likewise, we hypothesize that 
HDA* could expand a majority of the nodes with / < c*, before starting 
expanding nodes with f > c*. If an optimal solution is found soon after 



8 This is because the optimal solution cost for Mprimel5 is only 6 and there are many 
states with tied / values. 
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Figure 5: Search overhead relative to p m ir, 
averaged over all instances with the same p„ 



cores, on the HPC2 cluster. We plot data 



starting expanding nodes with / > c*, the extra work caused by expanding 
nodes with f > c* would be a small fraction of the total search effort of 
HDA*. This informal explanation is consistent with the cases of low _R> 
values observed among our data. 

All benchmark problems in this paper have unit costs for all transi- 
tions between states. We believe that this contributes to the good (low) 
re-expansion rate R r , when using p min cores and p > p min cores as well, 
unless p gets much larger than p min (see Figure H]). Kobayashi et al. [3l| 
have observed that the re-expansion rate increases in domains with non-unit 
transition costs. 

Next we focus on the search overhead when the baseline algorithm is 
HDA* using p min cores (as defined above, p min depends on the instance). 
Figure [5] summarizes search overhead data for domain-independent planning, 
showing average values over all instances with the same p m in- 

Once again, as a general tendency, as we go further away from the base- 
line Pmin-, the search overhead stays stable for a while, after which it grows 
(Figure [5]). Thus, some of the largest search overheads are seen when the 
difference p vs p min is the largest (2400 vs 12). However, search overhead 
does not appear to be simply caused by using a large number of processors. 
For example, on 13 out of 45 problems, the search overhead on 2400 cores 
is less than 10%. This is despite the fact that p min is in these cases signif- 
icantly lower than 2400, varying from 24 cores (PipesTanklO) to 600 cores 
(PipeNoTank25, PipeNoTank27). 

There seem to be some problem domains which are highly prone to search 
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overhead. Very large search overheads compared to other domains are seen in 
Sokoban problems (Sokoban.p24, Sokoban.p25, Sokoban.p26, Sokoban.p27, 
Sokoban.p28). The R* data available indicates that re-expansions are a ma- 
jor cause of this behavior. In addition, on 1200 cores, the i?< metric also 
degrades for a few Sokoban instances. As the number of processors increased 
from 600 to 1200, the search overhead on Sokoban. p24, Sokoban.p26, and 
Sokoban. p27, more than doubles. The growth in search overhead appears to 
be accelerating as p/p m i n increases. This large, rapidly growing search over- 
head explains the failure to solve the Sokoban problems with 2400 cores. As 
we increase from 1200 to 2400 cores, the amount of search overhead added is 
greater than the additional aggregate RAM capacity, so HDA* fails because 
some processor runs out of RAM. 

Note that there are some instances where search overhead is negative 
relative to p m i n . For example, Mprime30 has a negative overhead for p G 
{60, 144, 300}, relative to p m in = 12. This suggests some wasted search effort 
in the baseline configuration, which then gets corrected for a larger p. Indeed, 
the R < value is 62.2% for Mprime30 solved on p min = 12 cores, whereas most 
other instances, in all considered domains, have much better R < values (i.e., 
close to 100%) on their corresponding p min configuration. Since Mprime 
has much shorter solutions than the other instances, HDA* tends to expand 
many more states with the /-value identical to the optimal length. 

In summary, our analysis shows that HDA* is scalable, in the sense that 
it can use a large number of CPUs to solve difficult instances (which fail 
on a small number of CPU cores) quite efficiently, with a reasonably low 
search overhead. At the same time, for easy instances that can be solved 
with fewer CPUs, the tendency is the following: As the number of processors 
increases from p m i n , the search overhead stays low for a while, after which 
a degradation is observed. One possible approach to alleviate this is using 
a solving strategy able to dynamically change the hardware resources (e.g., 
number of CPUs) [!l[ . Mechanisms for keeping the search overhead low when 
using many more processors than p m in is an interesting topic for future work. 

4-2.3. Node Expansion Rate 

The node expansion rate helps evaluate other parallel overhead, besides 
the search overhead. As the number of processors increases, the communi- 
cation overhead can reduce node expansion rate. Part of the communication 
overhead is alleviated by the fact that searching and travelling overlap to a 
great extent. Said in simple words, some states are being expanded while 
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Figure 6: Node expansion rate relative to the p m in cores baseline, on the HPC2 cluster. 



some other states are travelling to their owner processor. 

Since each core sends generated successors to their home processors in 
HDA*, the total number of messages exchanged among processors increases 
with a larger number of cores. This is exacerbated by the fact that more cores 
can end up generating more states (search overhead). Therefore, another 
cause of slower node expansion rates may be message processing overhead 
- even with asynchronous communications, HDA* must deal with a larger 
number of messages as the number of cores increases. 

Figure |6] plots the expansion rate, measured for planning instances on the 
HPC2 cluster. Each data point in the figure is computed as follows. Let N£ 
be the total number of expansions for an instance i solved on p cores. Let 
Tp, the total search time to solve instance 6, be the (parallel) wall-clock time 
multiplied by p. Then, the node expansion rate is Np/Tp. Let A(p,p min ) be 
the expansion rate, when using p cores, averaged over all instances i with a 
given p min value. Finally, R(p,p min ) = A(p,p min )/A(p min ,p min ) normalizes 
these average values relatively to the baseline configuration with p m in cores. 
Figure [6] plots R(p,p min ). If R(p,p m i n ) is close to 1, this indicates that the 
communication overhead is low. Values smaller than 1 indicate a degradation 
of the expansion rate, typically resulting in a degradation of the parallel 
search efficiency, unless the increase in time spent per node is compensated 
by a corresponding reduction in the total number of node expansions. 

Figure [6] shows that the expansion rate remains almost constant relative to 
Pmin cores, unless the difference p—p m in grows too large. Note that due to the 
logarithmic scale of the X-axis (# of cores), the expansion rate is more stable 
than it might initially appear in Figure El Despite the degradation of node 
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expansion rate at each processor, the aggregate node expansion rate across 
all processors continues to increase as the number of processors increases. 
For example, when p = 2400 and p m in = 12, the expansion rate degrades by 
a factor of about 22, according to Figure |6j On the other hand, the number 
of available CPUs increases by a factor or 2400/12 = 200, so the net gain 
in aggregate node expansion rate is 200/22 = 9.1 times. Similarly, there are 
substantial overall gains in the aggregate expansion rate, for all p and p min 
values considered. 

4-2.4- Termination Detection 

Termination detection is not a performance bottleneck, as it succeeds 
quickly after proving solution optimality. When an optimal (but not proven 
optimal yet) solution is found, its cost c* is broadcast, such that nodes with 
/ > c* will not be expanded from now on. Thus, the only nodes expanded 
after the broadcast (if any) are nodes with / < c*, which must be expanded 
anyway to prove optimality (serial A* expands them as well). Therefore, 
by the time c* has been broadcast and all nodes with f < c* have been 
expanded, state expansion and generation will stop, messages with states 
will not be sent around any longer, and the termination test will succeed. 

4-2.5. Scaling Behavior and the Number of Nodes (Machines) on a HPC 
Cluster 

So far, we have considered the scaling of HDA* as the number of cores was 
increased. However, the scaling behavior of HDA* on a cluster is affected 
by factors other than the number of cores. Another factor is the cost of 
communication between nodes. Communications between cores in the same 
node are done via a shared memory bus, while communications between 
cores residing on different nodes are performed over a local area network 
(e.g., Infmiband or Ethernet). 

To evaluate the impact of the communication delay, we ran the planner on 
the HPC1 cluster, whose specs are shown in Table [TJ We used 64 cores, where 
the cores were distributed evenly on 4-64 nodes (i.e., 1-16 cores per node). 
The results are shown in the upper half of Table HI If the communication 
delay was a significant factor, we would expect that, as the number of nodes 
increased, the runtime would increase. Interestingly, Table H] (upper half) 
shows that runtimes decreased on almost all of the problem instances as the 
number of nodes increased. 
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Table 4: 64-core scaling results with no dummy (upper half) and with dummy processes 
(bottom half). Time and plan length are shown. Notice the increase in the average time 
caused by adding dummy processes. Experiments were performed on the HPC1 cluster. 



The explanation for this counterintuitive result is memory contention 
within each single node. The limited bandwidth of the memory bus is sat- 
urated when many cores in the node simultaneously access memory. We 
further investigated this hypothesis as follows. First, cache contention was 
ruled out as a possible explanation because the processors in the cluster are 
based on the AMD Opteron architecture, where each core has a private LI 
and L2 cache. Then, we performed an experiment where, again, we used 
4-64 nodes, but this time, on each node, on each core that was not used by 
HDA*, we executed a dummy process, which makes no direct contribution 
towards solving the instance at hand. Each dummy process is an invocation 
of HDA* which is forced to run sequentially on the instance, and therefore 
behaves similarly to the "real" HDA* process with respect to memory access 
patterns (and therefore contends for memory with the "real" HDA* process). 

Table H] indicates that the overhead of local memory access congestion 
within a node is much larger than the overhead due to communication be- 
tween nodes. In the normal configurations without dummy processes (Table 
HI upper half), using fewer cores per node results in less local memory con- 
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tention and more inter-node communication. Overall, the performance im- 
proves as the 64 cores are distributed among an increasing number of nodes. 
On the other hand, in the configurations with dummy processes (Table HI 
bottom half), local memory access within a node becomes equally congested 
regardless of the number of nodes, so there is no benefit to distributing 
the cores. Once the local memory access overhead is equalized for all con- 
figurations (4-64 nodes), variations in performance (if any present) among 
configurations could be explained by the overhead of inter-node communica- 
tion. We notice, however, that the differences are small in row "Avg time" 
in the bottom half of Table HJ Even on a per-instance basis, the performance 
degradation as the number of nodes increases from 4 to 64 is within 20%. 
This shows that communications among nodes is not a critical bottleneck for 
HDA* on the HPC1 cluster. 



4-3. Results on the 24-Puzzle on a HPC Cluster 

We evaluated HDA* on the 24-puzzle, using an application-specific solver 
based on IDA* code provided by Rich Korf, which uses a disjoint pattern 
database heuristic 11]. We replaced the IDA* search strategy with A* and 
HDA*. In the original code, each state has two redundant halves, trading 
memory for a faster state processing. While this makes sense for IDA*, it is 
not the best trade-off for HDA*. Thus, we reduce the state size to one half 
and compute the missing half on demand with a loop of 25 iterations per 
state. Other parts of the program, including the disjoint pattern database 
heuristic, were unchanged. As benchmark instances, we used the 50-instance 



set of 24-puzzles reported in [11], Table 2. We excluded the time required to 
read pattern databases from the hard-disk. 

First, we investigated the scalability of HDA* by running it on the fifty 
24-puzzle instances, using 1-1200 cores on 1-100 processing nodes. As with 
domain-independent planning, we computed the efficiencies relative to the 
smallest number of cores p m in £ {12, 24, 60, 144, 300, 600, 1200} which solved 
the problem. These average relative efficiencies for each p m i n are shown in 
Figure CD Compared to domain- independent planning (Figured]), the relative 
efficiency of HDA* degrades much more rapidly on the 24-puzzle. As previ- 
ously mentioned in Section [TJ a key difference between domain-independent 
planning and the 24-puzzle is that individual states are processed much faster 
in the application specific 24-puzzle solver, and therefore parallel overheads 
have a greater weight in the total running time. 
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Table provides summary data obtained on the HPC2 cluster (see Tabled] 
for machine specs). Instances are partitioned according to p m j n . For each 
partition, we report the runtime, the speedup, the load balance, the search 
overhead relative to p m in cores, and the R* metrics (i.e., _R<, R=, i?>, R r ) 
related to the search overhead, introduced in Section 14.2.21 All shown values 
are averages over instances with the same p m m- The load balance is quite 
good, and is no worse than 1.17 (300 and 600 cores for p min = 12). For p min 
processors, the search overhead is small. R r and _R> are within 4%, even for 
p min = 1200. R= varies from 8% (p min G {144,300}) to 24% {p min = 24). 
As with our planning results, search overhead increases, as more processors 
are used and the difference between p and p min increases. Not surprisingly, 
the largest increase is seen when p min = 12 and p = 1200 cores. When an 
instance is easy enough to be solved with 12 cores, 1200 cores will do much 
redundant work, resulting in R > as high as 90%. The search overhead has 
a significant impact on the overall speedup reported in Table [5j As p grows 
larger and larger than p m i n , the speedup increases for a while, after which 
the search overhead dominates and the speedup degrades. 

The trends in search overhead for the 24-puzzle are similar to those ob- 
served for domain- independent planning. On difficult instances (large p m in), 
HDA* can make an effective use of a large number of cores, solving such 
instances with a reasonably low search overhead. For instances that are 
sufficiently easy to be solved with relatively few cores (small p m i n ), the per- 
formance (e.g., speedup) increases for a while, after which a degradation is 
observed. A remarkable difference from the planning data is that the search 
overhead is significantly higher in the 24-puzzle, and it has a sharper degra- 
dation rate as well. 

4-4- Scaling Behavior on Planning on a Commodity Cluster 

While the previous set of large-scale experiments were performed on a 
campus high-performance computing cluster, we also evaluated the scala- 
bility of HDA* on the Commodity cluster (see Table [1] for machine specs). 
Table [6] shows the relative efficiency of 16, 32, and 64 cores compared to a 
baseline of 8 cores (1 processing node). The results are organized according 
to p m i n G {8, 16,32}, the minimum number of tested cores that solved the 
instance. For each p m in, the average relative efficiency and relative speedups 
are shown. Several trends can be seen. First, when p m in = 8, and the num- 
ber of cores used is increased to 16, the relative efficiency (0.55) and speedup 
(1.10) are very poor. On the other hand, after this initial threshold (the 
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Table 5: 24-puzzle scaling summary on the HPC2 cluster. 



jump from 1 processing node to 2 processing nodes) is crossed, the relative 
efficiency and relative speedups are near-linear, with low search overheads, 
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Figure 7: 24-puzzle relative efficiency on the HPC2 cluster. 



similar to our results above on the HPC1 and HPC2 clusters when the number 
of cores used is within a factor of 8 of p m in- Inter-node communication plays a 
role in this behavior. Moving from one processing node to 2 nodes introduces 
inter-node communication, which is relatively slow in a commodity cluster. 
From two to more nodes the relative performance grows more steadily, since 
inter-node communication is present in all multi-node configurations. 

In addition, as shown below in Section HDA* significantly outperforms 
TDS, the previous state of the art algorithm, when the two are compared on 
this commodity cluster. 

5. Tuning HDA* Performance 

The previous section investigated the scaling behavior of HDA* on various 
parallel environments as the amount of available resources varied. In this 
section, we consider how the behavior of HDA* can be tuned by adjusting 
two parameters: the number of cores to utilize per processor node, and the 
number of states to pack in each message between processes. 

5.1. Adjusting the Number of Cores to Utilize Per Processing Node 

Next, we investigate the effect of scaling the number of processing nodes 
(machines) for the 24-puzzle. We ran the solver on a set of 100 nodes, which 
have 1200 cores in total, and varied the number of cores used per node 
between 1-12, so that 100-1200 cores were used. The runtimes are shown in 
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Instance II HDA* 8 cores I HDA* 16 cores I HDA* 32 cores I HDA* 64 cores 



DepotlO 


17.98 (1.0) 


16.97 (1.06) 


8.09 (2.22) 


5.60 (3.21) 


Drivcrlog8 


17.65 (1.0) 


18.22 (0.97) 


7.29 (2.42) 


4.41 (4.00) 


Freecell5 


20.82 (1.0) 


16.27 (1.28) 


8.41 (2.48) 


5.99 (3.48) 


Satellitc6 


20.00 (1.0) 


24.64 (0.81) 


19.86 (1.01) 


7.34 (2.72) 


ZenoTrav9 


27.97 (1.0) 


29.87 (0.94) 


11.65 (2.40) 


7.26 (3.85) 


ZcnoTravll 


79.43 (1.0) 


81.43 (0.98) 


32.36 (2.45) 


19.33 (4.11) 


PipcsNoTkl4 


39.86 (1.0) 


34.26 (1.16) 


12.03 (3.31) 


7.86 (5.07) 


Pegsol27 


26.97 (1.0) 


22.06 (1.22) 


11.08 (2.43) 


7.43 (3.63) 


Airportl7 


48.84 (1.0) 


33.96 (1.44) 


23.78 (2.05) 


15.99 (3.05) 


Gripper8 


56.91 (1.0) 


53.95 (1.05) 


22.51 (2.53) 


14.97 (3.80) 


Mystery6 


48.58 (1.0) 


39.17 (1.24) 


20.13 (2.41) 


11.91 (4.08) 


Truck5 


69.40 (1.0) 


73.72 (0.94) 


68.71 (1.01) 


22.46 (3.09) 


Sokobanl9 


24.73 (1.0) 


21.10 (1.17) 


11.30 (2.19) 


9.72 (2.54) 


Sokoban22 


63.65 (1.0) 


49.18 (1.29) 


25.20 (2.53) 


19.57 (3.25) 


BlockslO-2 


50.83 (1.0) 


47.35 (1.07) 


20.69 (2.46) 


12.64 (4.02) 


Logistics00-7-l 


231.01 (1.0) 


230.97 (1.00) 


91.09 (2.54) 


53.37 (4.33) 


Avg. rel. speedup 
Avg. rel. efficiency 
Avg. search overhead 


1.0 
1.0 



1.10 
0.55 
0.20% 


2.27 
0.57 
3.71% 


3.48 
0.87 
4.05% 


Depotl3 


- (-) 


254.41 (1.0) 


115.85 (2.20) 


75.09 (3.39) 


Freecell7 


- (-) 


265.00 (1.0) 


128.74 (2.06) 


75.27 (3.52) 


Rover 12 


- (-) 


126.11 (1.0) 


57.66 (2.19) 


40.27 (3.13) 


PipesNoTk24 


- (-) 


145.49 (1.0) 


63.31 (2.30) 


38.67 (3.76) 


Pegsol28 


- (-) 


93.55 (1.0) 


45.40 (2.06) 


29.45 (3.18) 


Gripper9 


- (-) 


273.47 (1.0) 


118.38 (2.31) 


76.39 (3.58) 


Truck6 


- (-) 


675.71 (1.0) 


337.01 (2.01) 


168.39 (4.01) 


Truck8 


- (-) 


489.57 (1.0) 


284.01 (1.72) 


116.22 (4.21) 


Logistics00-9-l 


- (-) 


361.88 (1.0) 


154.09 (2.35) 


86.91 (4.16) 


Miconicl2-2 


- (-) 


564.09 (1.0) 


259.01 (2.18) 


159.48 (3.54) 


Miconicl2-4 


- (-) 


600.22 (1.0) 


266.08 (2.26) 


169.49 (3.54) 


Mprime30 


- (-) 


247.19 (1.0) 


100.27 (2.47) 


68.28 (3.62) 


Avg. rel. speedup 
Avg. rel. efficiency 
Avg. search overhead 




1.0 
1.0 



2.18 
1.09 
0.61% 


3.63 
0.91 
0.91% 


Rover6 


- (-) 


- (-) 


268.59 ( 1.0) 


162.82 (1.65) 


Solved only on 64 cores: Freecell6 (302.50 seconds), Satellite7 (633.26) 
Sokoban26 (205.80), Blocksll-1 (194.57) and Logistics00-8-l (491.02). 



Table 6: Planning results on the Commodity cluster: runtime, relative speedup (between 
brackets), average efficiencies, and average search overhead, by p m in- 



Table [7J IDA* solves all 50 instances whereas with our HDA* using 12 
cores only 10 instances can be solved with 54GB of memory. 

As shown in Tabled using 1200 cores (4.5GB/core), 40 out of the 50 
problems were solved. With 600 cores (9GB/core), 41 problems were solved. 
With 300 cores (18GB/core), 44 problems were solved, and with 100 cores 
(54GB/core), 45 problems were solved. That is, reducing the number of pro- 
cesses down to one process per node increases the number of solved instances. 
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1U0 cores 


300 cores 


600 cores 


1200 cores 


Average time 


209.43 


68.61 


35.00 


35.93 


Nr solved 


45 


44 


41 


40 


d1 


6.35 


4.72 


15.94 


35.25 


d2 


356.53 


116.86 


75.21 


45.99 


d3 


31.61 


12.37 


15.04 


30.12 


p4 


33.68 


14.12 


14.04 


28.38 


p5 


11.78 


7.43 


13.23 


30.07 


nfi 


75.98 


28.15 


24.06 


36.46 


p7 


211.34 


75.58 


48.84 


42.21 


p8 


190.51 


74.45 


51.32 


34.68 


d9 


1038.60 


365.60 






pl2 


618.90 


200.37 


119.41 




pl3 


12.62 


6.46 


14.22 


32.34 


pl5 


446.57 


167.86 


99.71 


57.72 


pl6 


10.77 


5.71 


11.20 


26.67 


pl8 


1032.34 


367.38 






pl9 


162.77 


58.76 


40.16 


37.02 


p20 


188.30 


69.20 


47.09 


33.40 


p21 


925.68 


333.59 






p22 


4.12 


4.60 


12.75 


24.58 


p23 


410.70 


156.23 


96.00 


61.63 


p24 


277.98 


92.09 


58.31 


52.59 


p25 


1.37 


3.44 


10.19 


19.30 


p26 


34.32 


13.88 


18.86 


47.11 


p27 


68.71 


24.33 


23.68 


35.20 


p28 


2.55 


5.33 


13.33 


26.13 


p29 


8.23 


5.61 


11.80 


24.51 


p30 


4.54 


4.38 


11.91 


25.22 


p31 


52.41 


20.46 


19.12 


34.28 


p32 


2.36 


4.78 


10.85 


29.31 


p33 


1104.11 








p34 


412.80 


154.47 


91.31 


57.49 


p35 


183.70 


65.78 


50.36 


33.55 


p36 


4.70 


4.70 


12.24 


31.58 


p37 


10.17 


6.36 


14.28 


30.20 


p38 


1.57 


5.36 


13.48 


28.11 


p39 


280.13 


94.78 


57.25 


44.66 


p40 


1.01 


3.44 


10.21 


22.02 


p41 


41.20 


17.07 


16.88 


37.99 


p42 


282.17 


98.58 


56.30 


42.23 


p43 


52.75 


20.57 


20.61 


37.44 


p44 


1.48 


4.06 


11.88 


24.89 


p45 


132.27 


43.76 


32.06 


34.47 


p46 


163.49 


59.35 


42.30 


37.01 


p47 


32.70 


13.52 


16.29 


36.97 


p48 


412.32 


152.31 


90.36 


56.71 


p49 


86.03 


30.95 


22.88 


31.61 



Table 7: Runtimes in seconds for the 24-puzzle on 100 nodes of the HPC2 cluster. Average 
runtime only includes instances solved by all configurations. Instances which were not 
solved by any configuration are excluded. 



A closer look at the trade-offs involved explains this behavior. First 
there is the reduction in execution time when performing a given amount 
of work with more cores. Indeed, Table [7| indicates that, if an instance is 
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solved by a larger number of cores, the time tends to decrease as more cores 
were used. The notable exception is the configuration with 1200 cores where 
the large search overhead actually degrades the time performance. Other 
exceptions are instances that are easy for 100 cores, such as p22, p25, p28, 
p38, p40 and p44, which were solved in a few seconds and left little room for 
further improvement in runtime, resulting in consistently increasing runtimes 
as the number of cores increased. On the other hand, using more cores per 
processing node can lead to solving fewer instances. Suppose that k unique 
states need to be stored in open/closed in order to solve a problem. A 
sequential search with enough memory capacity to hold k states can solve this 
problem. If HDA* allocated work perfectly equally among the processors, 
and there was no search overhead, then the partitioning of memory among n 
cores would have no negative impact. In practice, load balancing is imperfect, 
and search overhead is non-zero, so increasing the number of cores (for a fixed 
amount of aggregate RAM) increases the chance of failure on a hard problem. 

When analyzing the relative impact of search overhead vs. RAM par- 
titioning, we found that the former plays a significantly greater role in the 
reduction of the number of solved instances. Table shows that the search 
overhead eventually increases significantly as the number of cores increases, 
meaning that more nodes must be stored in the open/closed lists. The rate 
of increase accelerates as the number of cores increases further beyond the 
minimal number of cores needed to solve the problem. On the other hand, 
our use of static RAM partitioning (as opposed to a more flexible, dynamic 
partitioning) is not a significant bottleneck. We measured the size of the open 
and closed lists for all processors, and observed that when the first processor 
runs out of memory, most other processors have almost exhausted their mem- 
ory as well. This means that there is little opportunity for improvements to 
be made by using a more flexible, dynamic memory partitioning method. 

In Section 14.2.51 we observed slowdowns in domain- independent planning 
as the ratio of utilized cores per node increased, and ascribed the slowdown 
to local memory bus contention. In the 24-puzzle, the node expansion rate 
decreases by 35% as the number of cores per processing node increased from 
1 to 12. We attribute this increase to both local memory bus contention, and 
factors discussed in Section 14.2.31 

5.2. The Effect of the Number of States Packed into Each Message 

In HDA*, each state generation necessitates sending the state from pro- 
cessor where a state is generated to the processor which "owns" the state. 
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Sending a message from processor P to Q each time a state owned by Q is 
generated at P may result in excessive communication overhead, as well as 
overhead for creating/manipulating MPI message structures. 



In order to amortize these overheads, Romein et al. [12| , in their work on 
TDS, proposed packing multiple states with the same destination. 

On the other hand, packing too many states into a message from processor 
P to Q might result in degraded performance for two reasons. First, the 
destination Q might be starved for work and be idle. Second, too much 
packing can result in search overhead, as follows. Consider a state S on an 
optimal path, which is "delayed" from being sent to its owner because the 
processor which generated S is waiting to pack more states into the message 
containing S. In the meantime, the owner of S is expanding states which are 
worse than S and possibly sending those successors to fill up the open list of 
their owner processors, and so on. 

We compared packing 10, 100, and 1000 states per message on the Commodity 
cluster using 64 cores, using the same benchmark instances in Table [6], except 
that the Airportl7 instance was excluded because it could not be solved us- 
ing 10 states per message. Using 10 states per message, the average runtime 
was 31.6% slower than when using 100 states per message, and 1.0% fewer 
nodes were expanded. Using 1000 states per message, the average runtime 
was 37.4% slower than when using 100 states per message, and 11.6% more 
nodes were expanded. Thus, while packing fewer states per message reduces 
search overhead, packing more states per message reduces communication 
overhead, and in this case, packing 100 states per message performs well by 
striking a balance between these two factors. 

6. Hash-Based Work Distribution vs. Random Work Distribution 
on a HPC Cluster 



Kumar et al. [23| and Karp and Zhang [24| proposed a simple, random 
work distribution strategy for best-first search where generated nodes are 
sent to a random processor. While this is similar to HDA* in that a random- 
ization mechanism is used to distribute work, the difference is that duplicate 
states are not necessarily sent to the same processor, since a state has no 
"owner" . Although duplicates are pruned locally at each processor, there is 
no global duplicate detection, so in the worst state can be in the local 

open/closed lists of every single processor. 
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We evaluated this "Random" work distribution strategy on our Fast- 
Downward based do main- independent planner on the HPC1 cluster. First, 
we attempted to compare HDA* and Random using 16 cores, 2GB per core. 
However, the Random strategy failed to solve any of the test problems in this 
configuration - all of the runs failed due to memory exhaustion. This indi- 
cated that, due to the lack of (global) duplicate detection, the random work 
distribution strategy requires much more RAM to solve our benchmarks. 
Hence we performed a comparison using 16 cores, 32GB per core (as with 
the experiments in Section 14.2.51 this configuration left 15 of the 16 cores on 
each processing node idle, in order to maximize memory available per core). 
The results are shown in Table [SJ The execution time and number of nodes 
expanded by HDA* is more than an order of magnitude less than those of 
the random work distribution strategy. In fact, the random work distribu- 
tion strategy is slower than the sequential A* algorithm because of large 
search overhead. The load balance is similar for both HDA* and random 
work distribution, indicating that hash-based work distribution is success- 
fully distributing the work as evenly as a pure, randomized strategy. These 
results clearly demonstrate the benefit of using hash-based work distribution 
in order to perform global duplicate detection as well as load balancing. 





Execution time 


nodes expanded xl0 B 


load balance 




Random 


HDA* 


Random 


HDA* 


Random 


HDA* 


Driverlogl3 


n/a 


1056.24 


n/a 


618 


n/a 


1.05 


Freecell6 


n/a 


990.19 


n/a 


341 


n/a 


1.04 


Freecell7 


3170.81 


234.37 


1,202 


97 


1.01 


1.02 


Satellite? 


n/a 


1493.33 


n/a 


219 


n/a 


1.005 


ZenoTravll 


1022.57 


54.93 


221 


14 


1.0002 


1.13 


ZcnoTravl2 


n/a 


4393.80 


n/a 


1,152 


n/a 


1.03 


PipcsNoTk24 


2094.81 


130.13 


1,065 


72 


1.0001 


1.005 


Pegsol28 


1040.28 


74.04 


1,364 


112 


1.01 


1.005 


Pegsol29 


n/a 


364.39 


n/a 


481 


n/a 


1.01 


Pegsol30 


n/a 


978.11 


n/a 


1,379 


n/a 


1.01 


Sokoban24 


2748.71 


171.28 


3,254 


244 


1.01 


1.01 


Sokoban26 


n/a 


488.15 


n/a 


659 


n/a 


1.01 


Sokoban27 


n/a 


942.57 


n/a 


1,162 


n/a 


1.03 



Table 8: HDA* vs random work distribution ("Random"). Execution time, nodes ex- 
panded, and load balance on 16 cores (on 16 processing nodes), using 32GB per core. 
Experiments were performed on the HPC1 cluster. 
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7. Comparison of HDA* vs. TDS on a Commodity Cluster 



Transposition-Driven Scheduling (TDS) is a parallelization of IDA* with 
a distributed transposition table, where hash-based work distribution is used 
to map states to processors for both scheduling and transposition table checks 



13J. TDS has been applied successfully to sliding tiles puzzles and Rubik's 



Cube [13J , and has also been adapted for adversarial search |32l . |33[ . While 
TDS has not been applied to planning, recent work has shown that IDA* with 
a transposition table (IDA*+TT) is a successful search strategy for optimal, 
domain- independent planning [42j - on problems which can be solved by A*, 
the runtime of IDA*+TT is usually within a factor of 4 of A*, and IDA*+TT 
can eventually solve problems where A* exhausts memory. Therefore, we 
compared HDA* and TDS [13] for planning. 

The experiments were performed on the Commodity cluster (see Table [1] 
for machine specs) using 64 cores, with a 20 minute time limit per instance. 
Both HDA* and TDS were run on 35 planning instances (the same instances 
as for the experiments in Section H3J Table E]). 

Our implementation of TDS uses a transposition table implementation 



based on |42j], where the table entry replacement policy is a batch replace- 
ment policy which sorts entries according to access frequency and periodically 
frees 30% of the entries (preferring to keep most frequently accessed entries) Jj 
We incorporated techniques to overcome higher latency in a lower-bandwidth 
network described in [13| . such as their modification to the termination de- 
tection algorithm. 

Out of 35 instances, HDA* failed to solve one case (Blocksl2-l) and 
TDS failed to solve 8 cases (Logistics00-8-l, Sokoban22, Sokoban26, Truck6, 
Gripper9, Freecell6, Rover6, Satellite?) . Figure [S] directly compares HDA* 
and TDS on 26 instances solved by both algorithms, for which it is possi- 
ble to compute the ratios of performance measures such as time, expanded 
states and evaluated stateso HDA* is consistently faster, with a maximum 
speedup of about 65. 

There are several differences between HDA* and TDS that could cause 



9 Although replacement based on subtree size performed best in sequential search 421 ] . 
we did not implement this policy because subtree size computation would require extensive 
message passing in parallel search. 

10 Evaluated states are states which were not found in the Open/Closed set (in the case 
of HDA*), or not found in the transposition table (in the case of TDS). 
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this significant performance difference. Below, we first enumerate these dif- 
ferences, and then consider how they apply to our results. 

First, TDS is an iterative deepening strategy, so TDS will incur state 
reexpansion overhead, since many states will be reexpanded as the iteration 
/-bound increases. 

Second, HDA* opens states in a (processor-local) best-first order, while 
TDS opens states in a last-in-first-out order in its local stack, subject to an 
iteration bound for the /-value. This difference in state expansion policy 
results in different search overheads for the two algorithms. 

Third, each processor in TDS needs memory not only for a transposition 
table and the merge-and-shrink abstraction (for planning), but also for a 
work stack. In the tested configuration, each processor is allocated 2GB of 
memory. 1.2GB is allocated to the transposition table and the abstraction. 
The remaining memory is reserved for the work stack0 Therefore, there 
may be cases where there is sufficient aggregate memory in HDA* to solve 
an instance, but TDS (on the same system) exhausts the space allocated 
for its distributed transposition table. In such a situation, TDS applies a 
replacement policy to replace some of its transposition table entries with 
newly expanded states. While this seeks to maintain the most useful working 
set of states, it is possible that valuable entries (states) in the table are 
replaced, resulting in wasted work later when these states are revisited. 

Fourth, TDS incurs synchronization overhead while all processors wait 
for the current iteration to end. 

Standard IDA* often generates states faster than A*, because successor 
generation can be implemented as an in-place incremental update of a data 
structure representing the current state. Thus, sequential IDA* can outper- 
form A* even if IDA* searches less efficiently than A*. When a transposition 
table is added to IDA*, state generation incurs a significant overhead be- 
cause the hash value of each generated state must be computed, and if there 
is a transposition table miss, some hashed representation of the state must 
be stored in the table. Furthermore, in TDS, every distributed transposition 



11 While Romein et al.'s stack does not exceed 1MB in their applications [13[, our stack 
often used hundreds of megabytes of memory. Due to the much larger branching factors of 
our planning domains, compared to their domains (15-puzzle, Rubik's cube), TDS sends 
away more states when generating successors, causing a larger work queue. A possible 
future improvement is to combine Dutt and Mahapatra's technique in SEQ_A* |6| with 
TDS and restrict initiating parallelism. 
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0.1 1 

Instances ordered by time ratio 



Figure 8: Ratios TDS/HDA* on 64 cores, on the Commodity cluster (see Table Q] for 
machine specs). 



table access requires a hashed representation of the state to be generated and 
sent to the owner process. Thus, parallel IDA* no longer has an inherently 
faster state generation rate than parallel A*. 

We now consider the contribution of each of the above differences. In 
Figure El the ratio of expanded states closely follows the runtime ratio@ 
On the other hand, the ratio of evaluated states is close to 1 in most cases. 
This suggests that both algorithms explore a similar number of unique states. 
Furthermore, the similarity in the number of evaluated states shows that the 
transposition table used by TDS is large enough to fit most (unique) states 
generated during search, so the size of the transposition table did not limit 
the performance of TDS. 

The difference in performance is mostly due to the re-expansion overhead 
incurred by TDS during iterative deepening. This conclusion is supported by 
the close correspondence between the runtime ratio and the state expansion 
ratio. Also, it is consistent with the behavior of sequential A* and IDA*. For 
example, on the Airportl7 instance, both A* and IDA* evaluate 10,397,245 
states. While A* expands only 7,126,967 states, IDA* expands 239,705,187 



12 The ratio of generated states, not shown to reduce clutter, is almost identical to the 
ratio of expanded states. 
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TDS estimated iterations 
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Figure 9: Estimated number of TDS iterations. 

states, indicating that most of the states are reexpandedf^l 

Clearly, the number of re-expansions depends on the number of iterations 
performed by TDS. The exact number of iterations is unavailable when TDS 
runs out of time. An upper bound can be computed as u = c* — h(so) + 
1, where c* is the optimal cost (available due to HDA*), and h(so) is the 
heuristic evaluation of the initial state. We observed that this was exactly the 
number of iterations in the cases where TDS succeeded within the time limit. 
More generally, in all cases that we observed, the cost threshold increase from 
one iteration to the next was 1. Thus, we hypothesize that the upper bound 
u is in fact quite an accurate estimate of the actual number of iterations. On 
the 35 instances considered in this experiment, the (estimated) number of 
iterations varies from 4 to 159, with an average value of 32.48. Data for all 
instances are available in Figure O 

The synchronization overhead in TDS between iterations, as all processors 
wait for the iteration to end, does not appear to be significant in most cases, 
which is consistent with Romein et al.'s results (otherwise, the runtime ratio 
would be consistently higher than the state expansion ratio). 



13 Extending TDS to backpropagate search results might reduce the very high reexpan- 
sion overhead in planning, which did not occur in Romein et al.'s applications. 



42 



Speedup over sequential search 
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Figure 10: Speedup over sequential search. A missing data point indicates that the sequen- 
tial algorithm or both the sequential and the parallel algorithm failed to find a solution. 
Experiments were performed on the Commodity cluster. 



Next we consider the speedups obtained by HDA* and TDS over their 
sequential counterparts, A* and IDA* with transposition table (IDA*+TT). 
The speedup of TDS vs IDA*+TT depends greatly on the amount of RAM 
memory available for the transposition table. The larger the (aggregate) 
transposition table, the faster TDS and IDA* are. While it is possible to 



achieve drastically large (sometimes super linear) speedup over IDA* 12|, [13 
the speedup diminishes as IDA* has access to larger and larger transposition 
tables. Figure [10] shows that, depending on the amount of RAM available 
to IDA*+TT, the speedup of TDS vs IDA*+TT can be either smaller or 
larger than speedup of HDA* vs A*. In this experiment, serial IDA*+TT 
was tested using 1.5GB of RAM and 26GB of RAM. TDS uses 1.2GB x 64 
cores, or 76.8GB of aggregate RAM. The time limit of serial IDA*+TT was 
set to 1280 minutes (i.e., about 21.3 hours) per instance. 

7.1. A Simple, Hybrid Strategy Combining HDA* and TDS 

The results above indicate that, for planning, HDA* significantly out- 
performs TDS on instances that can be solved within the memory available 
to HDA*. However, while HDA* will terminate and fail when memory is 
exhausted at any processor, TDS will not terminate if the local transposi- 
tion table at any processor becomes full. Mechanisms which allow A*-based 
search to continue after memory is exhausted have been proposed. For ex- 
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ample, PRA* [5j, has a state retraction mechanism which frees memory by 
retracting some states at the search frontier (however, as explained in Section 
[2]), the PRA* retraction policy results in synchronization overhead). Deter- 
mining an effective state retraction policy is a non-trivial extension to HDA* 
and an interesting direction for future work. 

Consider instances which are solvable by TDS (within a reasonable amount 
of time) but not by HDA*. Such problems are not very common. Given that 
HDA* and TDS explore a very similar set of unique states (as shown above), 
if a problem cannot be solved by HDA*, it indicates that the instance is quite 
difficult, and it is likely that TDS cannot solve the problem either within a 
reasonable time limit. 

The instances that HDA* failed to solve, as well as the amount of time 
HDA* executed before exhausting RAM) are listed below for 16, 32, and 64 
cores (all with 2GB per core): 

• Failed with 16 cores: Freecell6 (269 sec), Rover6 (453 sec), Satellite? 
(573 sec), Sokoban26 (176 sec), Blocksll-1 (165 sec), Blocksl2-l (148 
sec), Logistics00-8-l (416 sec); 

• Failed with 32 cores: Freecell6 (303 sec), Satellite? (722 sec), Sokoban26 
(190 sec), Blocksll-1 (190 sec), Blocksl2-l (156 sec), Logistics00-8-l 
(399 sec); 

• Failed with 64 cores: Blocksl2-l (189 sec). 

Next, we attempted to solve these instances using TDS with 16, 32, and 
64 cores (all with 2GB/core), with an extended time limit of 1.5 hours for 
32 and 64 cores and 3 hours for 16 cores per instance. The runtimes on the 
instances which were solved by TDS, but not by HDA* are as follows (TDS 
failed to find a solution within the time limit on the other instances): 

• 16 cores: Blocksll-1 (2928 sec), Blocksl2-l (4139 sec); 

• 32 cores: Freecell6 (4350 sec), Blocksll-1 (974 sec), Blocksl2-l (1874 

sec); 

• 64 cores: Blocksl2-l (838 sec). 

Thus, even with a 1.5 or 3 hours per instance, many instances that cannot 
be solved by HDA* cannot be solved by TDS. In addition, on the instances 
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where HDA* fails and TDS succeeds, HDA* fails relatively quickly compared 
to the time required by TDS to solve the problem. For example, with 32 cores 
HDA* exhausts memory in 156 seconds in solving Blocksl2-l, and although 
TDS can solve this problem with 32 cores, it required 1874 seconds. 

This suggests a very simple, hybrid approach which combines the speed of 
HDA* and the ability of TDS to eventually solve difficult problems without 
running out of memory. First, a slightly modified version of HDA* is applied, 
which keeps track of the lowest /-cost in the OPEN list is recorded at every 
processor. If HDA* finds a solution, then it is returned. However, if HDA* 
exhausts memory at any processor, then it terminates with failure, and also 
returns the minimum frontier value among all of the processors, f m in- Then, 
we apply TDS, except that instead of setting the initial iteration bound to 
the heuristic value at the initial state, we start with / m j n as the lower bound. 

According to the data above, if the HDA* phase succeeds, this hybrid 
procedure will succeed significantly faster than TDS would, and if it fails, 
it will fail relatively quickly. Upon failure, the hybrid starts TDS with a 
iteration bound f min (skipping some wasteful iterations). 

On instances solvable by HDA*, the hybrid runtime will be the same 
as for HDA*. On instances where HDA* fails but TDS would eventually 
succeed, the hybrid runtime would be similar to the runtime of TDS. The 
time spent in the failed run of HDA* should only be a small fraction of the 
time required to eventually solve the instance; furthermore, using the f m in 
initial bound for TDS is expected to offset some of the time spent in the 
failed HDA* run by eliminating some of wasted expansions in TDS. 

Table [9] shows the results of applying this hybrid to the instances listed 
above which caused HDA* to fail. In most cases, the runtimes for the hybrid 
are comparable to the runtimes for TDS. This shows that the hybrid success- 
fully combines the strengths of both HDA* and TDS while avoiding their 
disadvantages, and does so while avoiding excessive hybridization overhead. 

8. Related Work 

We have discussed the related work that parallelizes search by partitioning 
the search space in Section [2J In this section, we review related work on 
parallel search in model checking. In the second half, we also review other 
approaches to parallelizing search algorithms. 

While this paper focuses on standard AI search domains including do- 
main independent planning and the sliding tile puzzle, distributed search, 
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Instance 


] 

HDA* 


Hybrid 
TDS 


Total 


TDS 


16 cores 

Blocksll-1 


165 


2517 


2682 


2928 


Blocksl2-l 


148 


4192 


4340 


4139 


32 cores 

Freecell6 


303 


3781 


4084 


4350 


Blocksll-1 


190 


865 


1055 


974 


Blocksl2-l 


156 


1876 


2032 


1874 


64 cores 

Blocksl2-l 189 


566 


755 1 838 



Table 9: Time in seconds for TDS and Hybrid HDA* + TDS (these are instances which 
could not be solved by HDA* by itself). Experiments were performed on the Commodity 
cluster. 



including hash-based work distribution, has also been studied extensively by 
the parallel model checking community. Parallel Munp 43|, |44j addresses ver- 
ification tasks that involve exhaustively enumerating all reachable states in a 
state space. Similarly to HDA* and other work described in Section [2] jEl, 0], 
Parallel Munp implements a hash-based work distribution schema where each 
state is assigned to a unique owner processor. Kumar and Mercer [45| present 
a load balancing technique as an alternative to the hash-based work distribu- 
tion implemented in Munp. The Eddy Murphi model checker j46[ specializes 
processors' tasks, defining two threads for each processing node. The worker 
thread performs state processing (e.g., state expansion), whereas the other 
thread handles communication (e.g., sending and receiving states). 

Lerda and Sisto parallelized the SPIN model checker to increase the avail- 
ability of memory resources [13]. Similar to hash-based distribution, states 
are assigned to an owner processing node, and get expanded at their owner 
node. However, instead of using a hash function to determine the owner pro- 
cessor, only one state variable is taken into account. This is done to increase 
the likelihood that the processor where a state is generated is identical to 



the owner processor. Holzmann and Bosnacki [48( introduce an extension to 
SPIN to multicore, shared memory machines. Garavel et al. 49J use hash- 
based work distribution to convert an implicitly defined mo del- checking state 
space into an explicit file representation. Symbolic parallel model checking 
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has been addressed in 50 



Thus, hash-based work distribution and related techniques for distributed 
search have been widely studied for parallel model checking. There are sev- 
eral important differences between previous work in model checking and this 
paper. First, this paper focuses on parallel A*. In model checking, there is 
usually no heuristic evaluation function, so depth-first search and breadth- 
first search is used instead of best-first strategies such as A*. 



Second, reachability analysis in model checking (e.g., |43|, |44|, |47J, |49|), 
which involves visiting all reachable states, does not necessarily require op- 
timality. Search overhead is not an issue because both serial and parallel 
solvers will expand all reachable states exactly once. In contrast, we specifi- 
cally address the problem of finding an optimal path, a significant constraint 
which introduces the issue of search efficiency because distributed A* (includ- 
ing HDA*) searches many nodes with the /-cost greater than or equal to the 
optimal cost, as detailed in Section |4.2.2( furthermore, node re-expansions in 
parallel A* can introduce search overhead. This is the first paper to analyze 
search overhead. 

Finally, while previous work in model checking has used up to 256 pro- 
cessors [5l| , our work presents the largest scale experiments with hash-based 
work distribution to date, showing that hash-based work distribution can 
scale efficiently relative to p min even for up to 2400 processors. 

Kobayashi et al. 3l| have applied HDA* to multiple sequence alignment 
(MSA). The non-unit transition costs lead to a higher rate of re-expansions, 
increasing the search overhead. To address this, the authors introduced a 
work distribution strategy that exploits the structure that MSA and po- 
tentially other problems exhibit, replacing the Zobrist-based, global hashing 
scheme. In contrast, in this work, our focus is to analyze HDA* at length, 
hoping to establish HDA* as a simple and scalable baseline algorithm for 
parallel optimal search. 

One alternative to partitioning the search space among processors is to 
parallelize the computation done during the processing of a single search 
node (c.f., J52|, |53j). The Operator Distribution Method for parallel Planning 
(ODMP) [54| parallelizes the computation at each node. In ODMP, there 
is a single, controlling thread, and several planning threads. The controlling 
thread is responsible for initializing and maintaining the current search state. 
At each step of the controlling thread main loop, it generates the applica- 
ble operators, inserts them in an operator pool, and activates the planning 
threads. Each planning thread independently takes an operator from this 
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shared operator pool, computes the grounded actions, generates the result- 
ing states, evaluates the states with the heuristic function, and stores the 
new state and its heuristic value in a global agenda data structure. After the 
operator pool is empty, the controlling thread extracts the best new state 
from the global agenda, assigning it to the new, current state. 

Another approach is to run a set of different search algorithms in parallel. 
Each process executes mostly independently, with periodic communication 
of information between processors. This approach, which is a parallel ver- 
sion of an algorithm portfolio [55|, seeks to exploit the long-tailed runtime 
distribution behavior encountered in search algorithms [56[ by using differ- 
ent versions of search algorithms to search different (potentially overlapping) 
portions of the search space. An example of this is the ManySAT solver 57| . 
which executes a different version of a DPLL-based backtracking SAT solver 
on each processor and periodically shares lemmas among the processes. 



A third approach is parallel-window search for IDA* 58j . where each 
processor searches from the same root node, but is assigned a different bound 
- that is, each processor is assigned a different, independent iteration of IDA*. 
When a processor finishes an iteration, it is assigned the next highest bound 
which has not yet been assigned to a processor. 



Vidal et al. 59] propose a multicore version of the KBFS algorithm [60 
In this approach, each thread expands one node from the Open list at a 
time. As each expansion step requires operations on the Open and Closed 
list, synchronization is needed to ensure that only one thread at a time can 
perform such operations. Experiments are reported for satisficing planning, 
as opposed to our work, which is focused on optimal planning. 

The best parallelization strategy for a search algorithm depends on prop- 
erties of the search space, as well as the parallel architecture on which the 
search algorithm is executed. The EUREKA system 6l| used machine learn- 
ing to automatically configure parallel IDA* for various problems (including 
nonlinear planning) and machine architectures. 



Niewiadomski et al. |40| propose PFA*-DDD, a parallel version of Fron- 



tier A* with Delayed Duplicate Detection. While achieving very good speed- 
up efficiency (often superlinear), PFA*-DDD is limited to returning only 
the cost of a path from start to target, not an actual path. While divide- 
and-conquer (DC) can be used to reconstruct a path (as in sequential frontier 
search), parallel DC poses non-trivial design issues that need to be addressed 
in future work. See the original paper for a discussion. 
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9. Discussion and Conclusion 



This paper investigated the use of hash-based work distribution to paral- 
lelize A* for hard graph search problems such as domain-independent plan- 
ning. We implemented Hash-Distributed A*, a simple, scalable paralleliza- 
tion of A*. The key idea, first used in Parallel Retracting A* js[, is to 
distribute work according to the hash value for generated states. HDA* 
is a simple implementation of this idea, which, to our knowledge, has not 
been previously evaluated in depth. Unlike PRA*, which was a synchronous 
algorithm due to its retraction mechanism, HDA* operates completely asyn- 
chronously. Also, unlike previous work such as PRA* and GOHA [6(, which 
implemented hash-based work distribution on variants of A*, HDA* is a 
straightforward implementation of hash-based work distribution for stan- 
dard A*. We evaluated HDA* as a replacement for the sequential A* search 
engine for a state-of-the-art, optimal sequential planner, Fast Downward (if. 
We also evaluated HDA* on the 24-puzzle domain by implementing a parallel 
solver with a disjoint pattern database heuristic [llj . 




Our experimental evaluation shows that HDA* scales well in several par- 
allel hardware configurations, including a single shared memory machine, 
high-performance computing clusters using Infiniband interconnects, and a 
local cluster with 1Gb (x2) Ethernet network. 

While HDA* is naturally suited for distributed memory parallel search 
on a distributed memory cluster of machines, we have shown that HDA* also 
achieves reasonable speedup on a single, shared memory machine with up to 
8 cores, yielding speedups of 3.8-6.6 on 8 cores. Burns et al. have recently 
proposed PBNF, a shared memory, parallel best-first search algorithm, and 
showed that PBNF outperforms HDA* on planning in shared memory envi- 
ronments • On the other hand, while HDA* is not necessarily the fastest 
algorithm on a shared memory environment, HDA* is simpler than PBNF 
(which is based on structured duplicate detection techniques). Furthermore, 
HDA* is suited for larger, distributed memory clusters, while PBNF is de- 
signed for shared memory machines. 

Evaluation of HDA* on a large high-performance cluster using up to 2400 
cores showed that HDA* scales well relative to p m in, the minimum number of 
processors (+memory) that can solve the problem. In our benchmarks p m in 
ranged up to 600 cores. When up to 8 times p min processors are used, HDA* 
scales up relatively efficiently However, as additional processors are used, the 
search overhead (wasteful node expansions) increases, resulting in degraded 
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efficiency as the number of processors used greatly exceeds p m i n . Comparison 
with a randomized work distribution strategy which performs load balancing 
but no duplicate detection 23|, |24J showed the simple hash-based duplicate 
detection mechanism is essential to the performance of HDA*. The scaling 
behavior of HDA* on a small commodity cluster with 1Gb (x2) Ethernet 
shows similar trends as on the HPC cluster, except that going from 8 cores 
(1 processor node) to 16 cores (2 processor nodes) is inefficient. Increasing 
the number of processing nodes beyond 2 scaled smoothly. Qualitatively 
similar scaling results were obtained for the 24-puzzle. However, because 
of the faster node processing time for the 24-puzzle compared to domain- 
independent planning, scaling degraded faster for the 24-puzzle. 

Much of the previous literature on parallel search has focused on maxi- 
mizing the usage of all available CPU cores. Our results indicate that this 
approach can be suboptimal. In fact, the best performance can be obtained 
by keeping many of the available processors idle. This is a result of the 
prevalent architecture in current clusters, which are composed of commod- 
ity, multicore shared-memory processors connected via a high-speed network. 
There are two distinct factors which can lead to suboptimal performance 
when CPU core usage is maximized. First, as we showed in Section I4.2.5[ 
contention for local memory on each machine becomes a bottleneck, so using 
all of the available cores on a machine results in performance degradation. 
Automatically determining the optimal number of cores to use per processing 
node in order to obtain the best tradeoff between computation and memory 
bandwidth is an area for future work. Second, in parallel best-first search, 
there is a tradeoff between the number of cores used per machine, and the 
fragmentation of the local open list. Fragmentation of the open list results 
in search overhead, because the local open lists are not aware of the cur- 
rent global best /-values. Developing mechanisms which seek to reduce this 
search overhead is an area for future work. 

Our comparison of HDA* with TDS showed that when HDA* had suffi- 
cient memory to solve a problem, it significantly outperformed TDS on our 
benchmark planning instances. We showed that this performance gap was 
due to the reexpansion of states by TDS. We observed that on problem in- 
stances where HDA* fails due to memory exhaustion, TDS either fails due 
to exceeding the time limit, or TDS eventually solves the instance but re- 
quires a long time. Thus, we investigated a simple, hybrid strategy which 
first executes HDA* until either the problem is solved or until memory is 
exhausted. If HDA* fails, execute TDS starting with a more informed initial 
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threshold, provided by the failed HDA* run. We showed that this hybrid 
strategy effectively combines the advantages of both HDA* and TDS. 

One particularly attractive feature of HDA* is its simplicity. Work dis- 
tribution and duplicate detection is done by a simple hash function, and 
there is no complex load balancing mechanism. All communications are 
asynchronous. As far as we know, HDA* is the simplest parallelization for 
A* which achieves both load balancing and duplicate detection. Simplicity 
is very important for parallel algorithms, particularly for an algorithm that 
runs on multiple machines, as debugging a multi-machine, multicore algo- 
rithm is extremely challenging. In some preliminary efforts to implement a 
distributed memory, work-stealing algorithm, we have found that it is signif- 
icantly more difficult to implement it correctly and efficiently compared to 
HDA*. Furthermore, using the standard MPI message passing library, the 
same HDA* code can be recompiled and executed on a wide range of shared 
memory and distributed clusters, making it simple to port HDA*. This is in 
contrast to more complex algorithms which depend on specific architectural 
features such as shared memory. 

While our investigation of HDA* scaling reveals that there is clearly room 
for improvement when the amount of parallel resources used greatly exceeds 
the minimal required resources (i.e., p > 8xp min ), we have shown that HDA* 
scales reasonably well across a wide range of parallel platforms, including a 
single multicore machine, a commodity cluster, and two high-performance 
computing clusters. Combined with its simplicity and portability, this sug- 
gests that HDA* should be considered a default, baseline algorithm for par- 
allel best-first search. 

This paper focused on the search phase of problem solving. Paralleliza- 
tion of the heuristic construction phase (e.g., computation of the abstraction 
heuristic table [3] and pattern database generation is another area for 
future work. A related avenue for future work is a more effective use of mem- 
ory for heuristic tables. Our current implementation of HDA* uses a single 
process per core. When executed on one or more (shared memory) multicore 
machines, our implementation of HDA* executes as a set of independent 
processes without sharing any memory resources among cores that are on 
the same machine. This means that the memory used for an abstraction 
heuristic in planning or for a pattern database in 24-puzzle is unnecessarily 
replicated n times on an n-core machine. We are currently investigating a 
hybrid, distributed/shared memory implementation of HDA* which elimi- 
nates this inefficiency. One possible approach is to distribute work among 
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machines using hash-based distribution, but within a single machine incorpo- 
rate techniques such as speculative expansion that have been shown to scale 
well on a shared memory environment js)]. 

Finally, although efficiency (with respect to runtime) is an important 
characteristic for parallel search, the ability to solve difficult problems that 
cannot be solved using fewer resources (because A* exhausts RAM, and other 
methods such as IDA* on a single machine take too long) may be the most 
compelling reason to consider large scale parallel search on distributed mem- 
ory clusters. Using up to 2400 cores and 10.5 terabytes of aggregate RAM, 
HDA* was able to compute optimal solutions to larger planning benchmark 
problems than was previously possible on a single machine. The increasing 
pervasiveness of massive "utility computing" resources (such as cloud com- 
puting services) means that further work is needed to develop and evaluate 
algorithms that can scale even further, to tens of thousands of cores and 
petabytes of RAM. Furthermore, a focus on solving problems using utility 
computing services which incur monetary usage costs requires the develop- 
ment of cost-effective utilization of resources. We have recently analyzed an 



iterative resource allocation policy to address this [41 
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